16 Dec, 2009

40 commits

  • Commit c04fc586c (mm: show node to memory section relationship with
    symlinks in sysfs) created symlinks from nodes to memory sections, e.g.

    /sys/devices/system/node/node1/memory135 -> ../../memory/memory135

    If you're examining the memory section though and are wondering what node
    it might belong to, you can find it by grovelling around in sysfs, but
    it's a little cumbersome.

    Add a reverse symlink for each memory section that points back to the
    node to which it belongs.

    Signed-off-by: Alex Chiang
    Cc: Gary Hade
    Cc: Badari Pulavarty
    Cc: Ingo Molnar
    Acked-by: David Rientjes
    Cc: Greg KH
    Cc: Randy Dunlap
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Chiang
     
  • When do_nonlinear_fault() realizes that the page table must have been
    corrupted for it to have been called, it does print_bad_pte() and returns
    ... VM_FAULT_OOM, which is hard to understand.

    It made some sense when I did it for 2.6.15, when do_page_fault() just
    killed the current process; but nowadays it lets the OOM killer decide who
    to kill - so page table corruption in one process would be liable to kill
    another.

    Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee that
    the process will be killed, but is good enough for such a rare
    abnormality, accompanied as it is by the "BUG: Bad page map" message.

    And recent HWPOISON work has copied that code into do_swap_page(), when it
    finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • CONFIG_DEBUG_SPINLOCK adds 12 or 16 bytes to a 32- or 64-bit spinlock_t,
    and CONFIG_DEBUG_LOCK_ALLOC adds another 12 or 24 bytes to it: lockdep
    enables both of those, and CONFIG_LOCK_STAT adds 8 or 16 bytes to that.

    When 2.6.15 placed the split page table lock inside struct page (usually
    sized 32 or 56 bytes), only CONFIG_DEBUG_SPINLOCK was a possibility, and
    we ignored the enlargement (but fitted in CONFIG_GENERIC_LOCKBREAK's 4 by
    letting the spinlock_t occupy both page->private and page->mapping).

    Should these debugging options be allowed to double the size of a struct
    page, when only one minority use of the page (as a page table) needs to
    fit a spinlock in there? Perhaps not.

    Take the easy way out: switch off SPLIT_PTLOCK_CPUS when DEBUG_SPINLOCK or
    DEBUG_LOCK_ALLOC is in force. I've sometimes tried to be cleverer,
    kmallocing a cacheline for the spinlock when it doesn't fit, but given up
    each time. Falling back to mm->page_table_lock (as we do when ptlock is
    not split) lets lockdep check out the strictest path anyway.

    And now that some arches allow 8192 cpus, use 999999 for infinity.

    (What has this got to do with KSM swapping? It doesn't care about the
    size of struct page, but may care about random junk in page->mapping - to
    be explained separately later.)

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KSM swapping will know where page_referenced_one() and try_to_unmap_one()
    should look. It could hack page->index to get them to do what it wants,
    but it seems cleaner now to pass the address down to them.

    Make the same change to page_mkclean_one(), since it follows the same
    pattern; but there's no real need in its case.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove three degrees of obfuscation, left over from when we had
    CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
    CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
    built when CONFIG_MMU, so don't need such conditions at all.

    Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
    169 defconfigs: leave those to evolve in due course.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's contorted mlock/munlock handling in try_to_unmap_anon() and
    try_to_unmap_file(), which we'd prefer not to repeat for KSM swapping.
    Simplify it by moving it all down into try_to_unmap_one().

    One thing is then lost, try_to_munlock()'s distinction between when no vma
    holds the page mlocked, and when a vma does mlock it, but we could not get
    mmap_sem to set the page flag. But its only caller takes no interest in
    that distinction (and is better testing SWAP_MLOCK anyway), so let's keep
    the code simple and return SWAP_AGAIN for both cases.

    try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
    amusing: once unravelled, it turns out to have been choosing between two
    different ways of doing the same nothing. Ah, no, one way was actually
    returning SWAP_FAIL when it meant to return SWAP_SUCCESS.

    [kosaki.motohiro@jp.fujitsu.com: comment adding to mlocking in try_to_unmap_one]
    [akpm@linux-foundation.org: remove test of MLOCK_PAGES]
    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: KOSAKI Motohiro
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
    in page->mapping, with the higher bits a pointer to the anon_vma; and have
    defined PageKsm(page) as that with NULL anon_vma.

    But KSM swapping will need to store a pointer there: so in preparation for
    that, now define PAGE_MAPPING_FLAGS as the low two bits, including
    PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
    other use for the bit emerges).

    Declare page_rmapping(page) to return the pointer part of page->mapping,
    and page_anon_vma(page) to return the anon_vma pointer when that's what it
    is. Use these in a few appropriate places: notably, unuse_vma() has been
    testing page->mapping, but is better to be testing page_anon_vma() (cases
    may be added in which flag bits are set without any pointer).

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If reclaim fails to make sufficient progress, the priority is raised.
    Once the priority is higher, kswapd starts waiting on congestion.
    However, if the zone is below the min watermark then kswapd needs to
    continue working without delay as there is a danger of an increased rate
    of GFP_ATOMIC allocation failure.

    This patch changes the conditions under which kswapd waits on congestion
    by only going to sleep if the min watermarks are being met.

    [mel@csn.ul.ie: add stats to track how relevant the logic is]
    [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • After kswapd balances all zones in a pgdat, it goes to sleep. In the
    event of no IO congestion, kswapd can go to sleep very shortly after the
    high watermark was reached. If there are a constant stream of allocations
    from parallel processes, it can mean that kswapd went to sleep too quickly
    and the high watermark is not being maintained for sufficient length time.

    This patch makes kswapd go to sleep as a two-stage process. It first
    tries to sleep for HZ/10. If it is woken up by another process or the
    high watermark is no longer met, it's considered a premature sleep and
    kswapd continues work. Otherwise it goes fully to sleep.

    This adds more counters to distinguish between fast and slow breaches of
    watermarks. A "fast" premature sleep is one where the low watermark was
    hit in a very short time after kswapd going to sleep. A "slow" premature
    sleep indicates that the high watermark was breached after a very short
    interval.

    Signed-off-by: Mel Gorman
    Cc: Frans Pop
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When the code jumps to the `out', `referenced' is still zero. So there is
    no need to check it.

    Signed-off-by: Huang Shijie
    Acked-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Just simplify the code when `mlocked' is true.

    Signed-off-by: Huang Shijie
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Fix the comment for try_to_unmap_anon() with the new arguments.

    Signed-off-by: Huang Shijie
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Commit 543ade1fc9 ("Streamline generic_file_* interfaces and filemap
    cleanups") removed generic_file_write() in filemap. Change the comment in
    vmscan pageout() to __generic_file_aio_write().

    Signed-off-by: Vincent Li
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vincent Li
     
  • Seems that page_io.c doesn't really need to know that page_private(page)
    is the swp_entry 'val'. Rework map_swap_page() to do what its name says
    and map a page to a page offset in the swap space.

    The only other caller of map_swap_page() is internal to mm/swapfile.c and
    it does want to map a swap entry to the 'sector'. So rename
    map_swap_page() to map_swap_entry(), make it 'static' and and implement
    map_swap_page() as a wrapper around that.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Reorder (and comment) the fields of swap_info_struct, to make better
    use of its cachelines: it's good for swap_duplicate() in particular
    if unsigned int max and swap_map are near the start.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • While we're fiddling with the swap_map values, let's assign a particular
    value to shmem/tmpfs swap pages: their swap counts are never incremented,
    and it helps swapoff's try_to_unuse() a little if it can immediately
    distinguish those pages from process pages.

    Since we've no use for SWAP_MAP_BAD | COUNT_CONTINUED,
    we might as well use that 0xbf value for SWAP_MAP_SHMEM.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Swap is duplicated (reference count incremented by one) whenever the same
    swap page is inserted into another mm (when forking finds a swap entry in
    place of a pte, or when reclaim unmaps a pte to insert the swap entry).

    swap_info_struct's vmalloc'ed swap_map is the array of these reference
    counts: but what happens when the unsigned short (or unsigned char since
    the preceding patch) is full? (and its high bit is kept for a cache flag)

    We then lose track of it, never freeing, leaving it in use until swapoff:
    at which point we _hope_ that a single pass will have found all instances,
    assume there are no more, and will lose user data if we're wrong.

    Swapping of KSM pages has not yet been enabled; but it is implemented,
    and makes it very easy for a user to overflow the maximum swap count:
    possible with ordinary process pages, but unlikely, even when pid_max
    has been raised from PID_MAX_DEFAULT.

    This patch implements swap count continuations: when the count overflows,
    a continuation page is allocated and linked to the original vmalloc'ed
    map page, and this used to hold the continuation counts for that entry
    and its neighbours. These continuation pages are seldom referenced:
    the common paths all work on the original swap_map, only referring to
    a continuation page when the low "digit" of a count is incremented or
    decremented through SWAP_MAP_MAX.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Halve the vmalloc'ed swap_map array from unsigned shorts to unsigned
    chars: it's still very unusual to reach a swap count of 126, and the
    next patch allows it to be extended indefinitely.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Though swap_count() is useful, I'm finding that swap_has_cache() and
    encode_swapmap() obscure what happens in the swap_map entry, just at
    those points where I need to understand it. Remove them, and pass
    more usable "usage" values to scan_swap_map(), swap_entry_free() and
    __swap_duplicate(), instead of the SWAP_MAP and SWAP_CACHE enum.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Move CONFIG_HIBERNATION's swapdev_block() into the main CONFIG_HIBERNATION
    block, remove extraneous whitespace and return, fix typo in a comment.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make better use of the space by folding first swap_extent into its
    swap_info_struct, instead of just the list_head: swap partitions need
    only that one, and for others it's used as a circular list anyway.

    [jirislaby@gmail.com: fix crash on double swapon]
    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap_info_struct is only 76 or 104 bytes, but it does seem wrong
    to reserve an array of about 30 of them in bss, when most people will
    want only one. Change swap_info[] to an array of pointers.

    That does need a "type" field in the structure: pack it as a char with
    next type and short prio (aha, char is unsigned by default on PowerPC).
    Use the (admittedly peculiar) name "type" throughout for this index.

    /proc/swaps does not take swap_lock: I wouldn't want it to, but do take
    care with barriers when adding a new item to the array (never removed).

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap_info_struct is mostly private to mm/swapfile.c, with only
    one other in-tree user: get_swap_bio(). Adjust its interface to
    map_swap_page(), so that we can then remove get_swap_info_struct().

    But there is a popular user out-of-tree, TuxOnIce: so leave the
    declaration of swap_info_struct in linux/swap.h.

    Signed-off-by: Hugh Dickins
    Cc: Nigel Cunningham
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • - avoid wasting more precious resources (DMA or DMA32 pools), when
    being called through vmalloc_32{,_user}()
    - explicitly allow using high memory here even if the outer allocation
    request doesn't allow it

    Signed-off-by: Jan Beulich
    Acked-by: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • Objects passed to NODEMASK_ALLOC() are relatively small in size and are
    backed by slab caches that are not of large order, traditionally never
    greater than PAGE_ALLOC_COSTLY_ORDER.

    Thus, using GFP_KERNEL for these allocations on large machines when
    CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in
    the allocation attempt, each time invoking both direct reclaim or the oom
    killer.

    This is of particular interest when using NODEMASK_ALLOC() from a
    mempolicy context (either directly in mm/mempolicy.c or the mempolicy
    constrained hugetlb allocations) since the oom killer always kills current
    when allocations are constrained by mempolicies. So for all present use
    cases in the kernel, current would end up being oom killed when direct
    reclaim fails. That would allow the NODEMASK_ALLOC() to succeed but
    current would have sacrificed itself upon returning.

    This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on
    CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations.
    All current use cases either directly from hugetlb code or indirectly via
    NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom
    killer when the slab allocator needs to allocate additional pages.

    The side-effect of this change is that all current use cases of either
    NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling
    when the allocation fails (never for CONFIG_NODES_SHIFT
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Offload the registration and unregistration of per node hstate sysfs
    attributes to a worker thread rather than attempt the
    allocation/attachment or detachment/freeing of the attributes in the
    context of the memory hotplug handler.

    I don't know that this is absolutely required, but the registration can
    sleep in allocations and other mem hot plug handlers do it this way. If
    it turns out this is NOT required, we can drop this patch.

    N.B., Only tested build, boot, libhugetlbfs regression.
    i.e., no memory hotplug testing.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Register per node hstate attributes only for nodes with memory. As
    suggested by David Rientjes.

    With Memory Hotplug, memory can be added to a memoryless node and a node
    with memory can become memoryless. Therefore, add a memory on/off-line
    notifier callback to [un]register a node's attributes on transition
    to/from memoryless state.

    N.B., Only tested build, boot, libhugetlbfs regression.
    i.e., no memory hotplug testing.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Andi Kleen
    Acked-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if
    there are no present pages left.

    In such a situation, kswapd must also be stopped since it has nothing left
    to do.

    Signed-off-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Yasunori Goto
    Cc: Mel Gorman
    Cc: Rafael J. Wysocki
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Register per node hstate sysfs attributes only for nodes with memory.
    Global replacement of 'all online nodes" with "all nodes with memory" in
    mm/hugetlb.c. Suggested by David Rientjes.

    A subsequent patch will handle adding/removing of per node hstate sysfs
    attributes when nodes transition to/from memoryless state via memory
    hotplug.

    NOTE: this patch has not been tested with memoryless nodes.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Acked-by: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Update the kernel huge tlb documentation to describe the numa memory
    policy based huge page management. Additionaly, the patch includes a fair
    amount of rework to improve consistency, eliminate duplication and set the
    context for documenting the memory policy interaction.

    Signed-off-by: Lee Schermerhorn
    Acked-by: David Rientjes
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Add the per huge page size control/query attributes to the per node
    sysdevs:

    /sys/devices/system/node/node/hugepages/hugepages-/
    nr_hugepages - r/w
    free_huge_pages - r/o
    surplus_huge_pages - r/o

    The patch attempts to re-use/share as much of the existing global hstate
    attribute initialization and handling, and the "nodes_allowed" constraint
    processing as possible.

    Calling set_max_huge_pages() with no node indicates a change to global
    hstate parameters. In this case, any non-default task mempolicy will be
    used to generate the nodes_allowed mask. A valid node id indicates an
    update to that node's hstate parameters, and the count argument specifies
    the target count for the specified node. From this info, we compute the
    target global count for the hstate and construct a nodes_allowed node mask
    contain only the specified node.

    Setting the node specific nr_hugepages via the per node attribute
    effectively ignores any task mempolicy or cpuset constraints.

    With this patch:

    (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
    ./ ../ free_hugepages nr_hugepages surplus_hugepages

    Starting from:
    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 0
    Node 2 HugePages_Free: 0
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0
    vm.nr_hugepages = 0

    Allocate 16 persistent huge pages on node 2:
    (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

    [Note that this is equivalent to:
    numactl -m 2 hugeadmin --pool-pages-min 2M:+16
    ]

    Yields:
    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 16
    Node 2 HugePages_Free: 16
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0
    vm.nr_hugepages = 16

    Global controls work as expected--reduce pool to 8 persistent huge pages:
    (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 8
    Node 2 HugePages_Free: 8
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0

    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Move definition of NUMA_NO_NODE from ia64 and x86_64 arch specific headers
    to generic header 'linux/numa.h' for use in generic code. NUMA_NO_NODE
    replaces bare '-1' where it's used in this series to indicate "no node id
    specified". Ultimately, it can be used to replace the -1 elsewhere where
    it is used similarly.

    Signed-off-by: Lee Schermerhorn
    Acked-by: David Rientjes
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This patch derives a "nodes_allowed" node mask from the numa mempolicy of
    the task modifying the number of persistent huge pages to control the
    allocation, freeing and adjusting of surplus huge pages when the pool page
    count is modified via the new sysctl or sysfs attribute
    "nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:

    * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
    is produced. This will cause the hugetlb subsystem to use
    node_online_map as the "nodes_allowed". This preserves the
    behavior before this patch.
    * For "preferred" mempolicy, including explicit local allocation,
    a nodemask with the single preferred node will be produced.
    "local" policy will NOT track any internode migrations of the
    task adjusting nr_hugepages.
    * For "bind" and "interleave" policy, the mempolicy's nodemask
    will be used.
    * Other than to inform the construction of the nodes_allowed node
    mask, the actual mempolicy mode is ignored. That is, all modes
    behave like interleave over the resulting nodes_allowed mask
    with no "fallback".

    See the updated documentation [next patch] for more information
    about the implications of this patch.

    Examples:

    Starting with:

    Node 0 HugePages_Total: 0
    Node 1 HugePages_Total: 0
    Node 2 HugePages_Total: 0
    Node 3 HugePages_Total: 0

    Default behavior [with or without this patch] balances persistent
    hugepage allocation across nodes [with sufficient contiguous memory]:

    sysctl vm.nr_hugepages[_mempolicy]=32

    yields:

    Node 0 HugePages_Total: 8
    Node 1 HugePages_Total: 8
    Node 2 HugePages_Total: 8
    Node 3 HugePages_Total: 8

    Of course, we only have nr_hugepages_mempolicy with the patch,
    but with default mempolicy, nr_hugepages_mempolicy behaves the
    same as nr_hugepages.

    Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
    '--membind' because it allows multiple nodes to be specified
    and it's easy to type]--we can allocate huge pages on
    individual nodes or sets of nodes. So, starting from the
    condition above, with 8 huge pages per node, add 8 more to
    node 2 using:

    numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

    This yields:

    Node 0 HugePages_Total: 8
    Node 1 HugePages_Total: 8
    Node 2 HugePages_Total: 16
    Node 3 HugePages_Total: 8

    The incremental 8 huge pages were restricted to node 2 by the
    specified mempolicy.

    Similarly, we can use mempolicy to free persistent huge pages
    from specified nodes:

    numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

    yields:

    Node 0 HugePages_Total: 4
    Node 1 HugePages_Total: 4
    Node 2 HugePages_Total: 16
    Node 3 HugePages_Total: 8

    The 8 huge pages freed were balanced over nodes 0 and 1.

    [rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
    Signed-off-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Factor init_nodemask_of_node() out of the nodemask_of_node() macro.

    This will be used to populate the huge pages "nodes_allowed" nodemask for
    a single node when basing nodes_allowed on a preferred/local mempolicy or
    when a persistent huge page pool page count is modified via a per node
    sysfs attribute.

    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Acked-by: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • In preparation for constraining huge page allocation and freeing by the
    controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
    to the allocate, free and surplus adjustment functions. For now, pass
    NULL to indicate default behavior--i.e., use node_online_map. A
    subsqeuent patch will derive a non-default mask from the controlling
    task's numa mempolicy.

    Note that this method of updating the global hstate nr_hugepages under the
    constraint of a nodemask simplifies keeping the global state
    consistent--especially the number of persistent and surplus pages relative
    to reservations and overcommit limits. There are undoubtedly other ways
    to do this, but this works for both interfaces: mempolicy and per node
    attributes.

    [rientjes@google.com: fix HIGHMEM compile error]
    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Modify the hstate_next_node* functions to allow them to be called to
    obtain the "start_nid". Then, whereas prior to this patch we
    unconditionally called hstate_next_node_to_{alloc|free}(), whether or not
    we successfully allocated/freed a huge page on the node, now we only call
    these functions on failure to alloc/free to advance to next allowed node.

    Factor out the next_node_allowed() function to handle wrap at end of
    node_online_map. In this version, the allowed nodes include all of the
    online nodes.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This is a series of patches to provide control over the location of the
    allocation and freeing of persistent huge pages on a NUMA platform.
    Please consider for merging into mmotm.

    This series uses two mechanisms to constrain the nodes from which
    persistent huge pages are allocated: 1) the task NUMA mempolicy of the
    task modifying a new sysctl "nr_hugepages_mempolicy", based on a
    suggestion by Mel Gorman; and 2) a subset of the hugepages hstate sysfs
    attributes have been added [in V4] to each node system device under:

    /sys/devices/node/node[0-9]*/hugepages

    The per node attibutes allow direct assignment of a huge page count on a
    specific node, regardless of the task's mempolicy or cpuset constraints.

    This patch:

    NODEMASK_ALLOC(x, m) assumes x is a type of struct, which is unnecessary.
    It's perfectly reasonable to use this macro to allocate a nodemask_t,
    which is anonymous, either dynamically or on the stack depending on
    NODES_SHIFT.

    Signed-off-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Christoph pointed out inc_zone_page_state(NR_ISOLATED) should be placed
    in right after isolate_page().

    This patch does it.

    Reviewed-by: Christoph Lameter
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Signed-off-by: Wu Fengguang
    Cc: Andi Kleen
    Cc: Avi Kivity
    Cc: Greg Kroah-Hartman
    Cc: Johannes Berg
    Cc: Marcelo Tosatti
    Cc: Mark Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Also rename "len" to "sz". No behavior change.

    Signed-off-by: Wu Fengguang
    Cc: Andi Kleen
    Cc: Avi Kivity
    Cc: Greg Kroah-Hartman
    Cc: Johannes Berg
    Cc: Marcelo Tosatti
    Cc: Mark Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang