14 Nov, 2014

1 commit

  • Current pageblock isolation logic could isolate each pageblock
    individually. This causes freepage accounting problem if freepage with
    pageblock order on isolate pageblock is merged with other freepage on
    normal pageblock. We can prevent merging by restricting max order of
    merging to pageblock order if freepage is on isolate pageblock.

    A side-effect of this change is that there could be non-merged buddy
    freepage even if finishing pageblock isolation, because undoing
    pageblock isolation is just to move freepage from isolate buddy list to
    normal buddy list rather than to consider merging. So, the patch also
    makes undoing pageblock isolation consider freepage merge. When
    un-isolation, freepage with more than pageblock order and it's buddy are
    checked. If they are on normal pageblock, instead of just moving, we
    isolate the freepage and free it in order to get merged.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

10 Oct, 2014

4 commits

  • struct compact_control currently converts the gfp mask to a migratetype,
    but we need the entire gfp mask in a follow-up patch.

    Pass the entire gfp mask as part of struct compact_control.

    Signed-off-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The migration scanner skips PageBuddy pages, but does not consider their
    order as checking page_order() is generally unsafe without holding the
    zone->lock, and acquiring the lock just for the check wouldn't be a good
    tradeoff.

    Still, this could avoid some iterations over the rest of the buddy page,
    and if we are careful, the race window between PageBuddy() check and
    page_order() is small, and the worst thing that can happen is that we skip
    too much and miss some isolation candidates. This is not that bad, as
    compaction can already fail for many other reasons like parallel
    allocations, and those have much larger race window.

    This patch therefore makes the migration scanner obtain the buddy page
    order and use it to skip the whole buddy page, if the order appears to be
    in the valid range.

    It's important that the page_order() is read only once, so that the value
    used in the checks and in the pfn calculation is the same. But in theory
    the compiler can replace the local variable by multiple inlines of
    page_order(). Therefore, the patch introduces page_order_unsafe() that
    uses ACCESS_ONCE to prevent this.

    Testing with stress-highalloc from mmtests shows a 15% reduction in number
    of pages scanned by migration scanner. The reduction is >60% with
    __GFP_NO_KSWAPD allocations, along with success rates better by few
    percent.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Async compaction aborts when it detects zone lock contention or
    need_resched() is true. David Rientjes has reported that in practice,
    most direct async compactions for THP allocation abort due to
    need_resched(). This means that a second direct compaction is never
    attempted, which might be OK for a page fault, but khugepaged is intended
    to attempt a sync compaction in such case and in these cases it won't.

    This patch replaces "bool contended" in compact_control with an int that
    distinguishes between aborting due to need_resched() and aborting due to
    lock contention. This allows propagating the abort through all compaction
    functions as before, but passing the abort reason up to
    __alloc_pages_slowpath() which decides when to continue with direct
    reclaim and another compaction attempt.

    Another problem is that try_to_compact_pages() did not act upon the
    reported contention (both need_resched() or lock contention) immediately
    and would proceed with another zone from the zonelist. When
    need_resched() is true, that means initializing another zone compaction,
    only to check again need_resched() in isolate_migratepages() and aborting.
    For zone lock contention, the unintended consequence is that the lock
    contended status reported back to the allocator is detrmined from the last
    zone where compaction was attempted, which is rather arbitrary.

    This patch fixes the problem in the following way:
    - async compaction of a zone aborting due to need_resched() or fatal signal
    pending means that further zones should not be tried. We report
    COMPACT_CONTENDED_SCHED to the allocator.
    - aborting zone compaction due to lock contention means we can still try
    another zone, since it has different set of locks. We report back
    COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
    it was aborted due to lock contention.

    As a result of these fixes, khugepaged will proceed with second sync
    compaction as intended, when the preceding async compaction aborted due to
    need_resched(). Page fault compactions aborting due to need_resched()
    will spare some cycles previously wasted by initializing another zone
    compaction only to abort again. Lock contention will be reported only
    when compaction in all zones aborted due to lock contention, and therefore
    it's not a good idea to try again after reclaim.

    In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
    has improved number of THP collapse allocations by 10%, which shows
    positive effect on khugepaged. The benchmark's success rates are
    unchanged as it is not recognized as khugepaged. Numbers of compact_stall
    and compact_fail events have however decreased by 20%, with
    compact_success still a bit improved, which is good. With benchmark
    configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
    collapse allocations, and only slight improvement in stalls and failures.

    [akpm@linux-foundation.org: fix warnings]
    Reported-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • isolate_migratepages_range() is the main function of the compaction
    scanner, called either on a single pageblock by isolate_migratepages()
    during regular compaction, or on an arbitrary range by CMA's
    __alloc_contig_migrate_range(). It currently perfoms two pageblock-wide
    compaction suitability checks, and because of the CMA callpath, it tracks
    if it crossed a pageblock boundary in order to repeat those checks.

    However, closer inspection shows that those checks are always true for CMA:
    - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
    - migrate_async_suitable() check is skipped because CMA uses sync compaction

    We can therefore move the compaction-specific checks to
    isolate_migratepages() and simplify isolate_migratepages_range().
    Furthermore, we can mimic the freepage scanner family of functions, which
    has isolate_freepages_block() function called both by compaction from
    isolate_freepages() and by CMA from isolate_freepages_range(), where each
    use-case adds own specific glue code. This allows further code
    simplification.

    Thus, we rename isolate_migratepages_range() to
    isolate_migratepages_block() and limit its functionality to a single
    pageblock (or its subset). For CMA, a new different
    isolate_migratepages_range() is created as a CMA-specific wrapper for the
    _block() function. The checks specific to compaction are moved to
    isolate_migratepages(). As part of the unification of these two families
    of functions, we remove the redundant zone parameter where applicable,
    since zone pointer is already passed in cc->zone.

    Furthermore, going back to compact_zone() and compact_finished() when
    pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
    - the checks are meant to skip pageblocks quickly. The patch therefore
    also introduces a simple loop into isolate_migratepages() so that it does
    not return immediately on failed pageblock checks, but keeps going until
    isolate_migratepages_range() gets called once. Similarily to
    isolate_freepages(), the function periodically checks if it needs to
    reschedule or abort async compaction.

    [iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

07 Aug, 2014

1 commit


05 Jun, 2014

5 commits

  • Compaction uses compact_checklock_irqsave() function to periodically check
    for lock contention and need_resched() to either abort async compaction,
    or to free the lock, schedule and retake the lock. When aborting,
    cc->contended is set to signal the contended state to the caller. Two
    problems have been identified in this mechanism.

    First, compaction also calls directly cond_resched() in both scanners when
    no lock is yet taken. This call either does not abort async compaction,
    or set cc->contended appropriately. This patch introduces a new
    compact_should_abort() function to achieve both. In isolate_freepages(),
    the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to
    match what the migration scanner does in the preliminary page checks. In
    case a pageblock is found suitable for calling isolate_freepages_block(),
    the checks within there are done on higher frequency.

    Second, isolate_freepages() does not check if isolate_freepages_block()
    aborted due to contention, and advances to the next pageblock. This
    violates the principle of aborting on contention, and might result in
    pageblocks not being scanned completely, since the scanning cursor is
    advanced. This problem has been noticed in the code by Joonsoo Kim when
    reviewing related patches. This patch makes isolate_freepages_block()
    check the cc->contended flag and abort.

    In case isolate_freepages() has already isolated some pages before
    aborting due to contention, page migration will proceed, which is OK since
    we do not want to waste the work that has been done, and page migration
    has own checks for contention. However, we do not want another isolation
    attempt by either of the scanners, so cc->contended flag check is added
    also to compaction_alloc() and compact_finished() to make sure compaction
    is aborted right after the migration.

    The outcome of the patch should be reduced lock contention by async
    compaction and lower latencies for higher-order allocations where direct
    compaction is involved.

    [akpm@linux-foundation.org: fix typo in comment]
    Reported-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: Michal Nazarewicz
    Tested-by: Shawn Guo
    Tested-by: Kevin Hilman
    Tested-by: Stephen Warren
    Tested-by: Fabio Estevam
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In previous commit(mm: use the light version __mod_zone_page_state in
    mlocked_vma_newpage()) a irq-unsafe __mod_zone_page_state is used. And as
    suggested by Andrew, to reduce the risks that new call sites incorrectly
    using mlocked_vma_newpage() without knowing they are adding racing, this
    patch folds mlocked_vma_newpage() into its only call site,
    page_add_new_anon_rmap, to make it open-cocded for people to know what is
    going on.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jianyu Zhan
    Suggested-by: Andrew Morton
    Suggested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • mlocked_vma_newpage() is called with pte lock held(a spinlock), which
    implies preemtion disabled, and the vm stat counter is not modified from
    interrupt context, so we need not use an irq-safe mod_zone_page_state()
    here, using a light-weight version __mod_zone_page_state() would be OK.

    This patch also documents __mod_zone_page_state() and some of its
    callsites. The comment above __mod_zone_page_state() is from Hugh
    Dickins, and acked by Christoph.

    Most credits to Hugh and Christoph for the clarification on the usage of
    the __mod_zone_page_state().

    [akpm@linux-foundation.org: coding-style fixes]
    Suggested-by: Andrew Morton
    Acked-by: Hugh Dickins
    Signed-off-by: Jianyu Zhan
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • We're going to want to manipulate the migration mode for compaction in the
    page allocator, and currently compact_control's sync field is only a bool.

    Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
    depending on the value of this bool. Convert the bool to enum
    migrate_mode and pass the migration mode in directly. Later, we'll want
    to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
    avoid unnecessary latency.

    This also alters compaction triggered from sysfs, either for the entire
    system or for a node, to force MIGRATE_SYNC.

    [akpm@linux-foundation.org: fix build]
    [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
    Signed-off-by: David Rientjes
    Suggested-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Cc: Naoya Horiguchi
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • mm/memory.c is overloaded: over 4k lines. get_user_pages() code is
    pretty much self-contained let's move it to separate file.

    No other changes made.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

08 Apr, 2014

2 commits

  • Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
    ra_submit with a single function call.

    Move ra_submit to internal.h and inline it to save some stack. Thanks
    to Andrew Morton for commenting different versions.

    Signed-off-by: Fabian Frederick
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • On NUMA systems, a node may start thrashing cache or even swap anonymous
    pages while there are still free pages on remote nodes.

    This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone
    allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect
    of fair allocation policy").

    Before those changes, the allocator would first try all allowed zones,
    including those on remote nodes, before waking any kswapds. But now,
    the allocator fastpath doubles as the fairness pass, which in turn can
    only consider the local node to prevent remote spilling based on
    exhausted fairness batches alone. Remote nodes are only considered in
    the slowpath, after the kswapds are woken up. But if remote nodes still
    have free memory, kswapd should not be woken to rebalance the local node
    or it may thrash cash or swap prematurely.

    Fix this by adding one more unfair pass over the zonelist that is
    allowed to spill to remote nodes after the local fairness pass fails but
    before entering the slowpath and waking the kswapds.

    This also gets rid of the GFP_THISNODE exemption from the fairness
    protocol because the unfair pass is no longer tied to kswapd, which
    GFP_THISNODE is not allowed to wake up.

    However, because remote spills can be more frequent now - we prefer them
    over local kswapd reclaim - the allocation batches on remote nodes could
    underflow more heavily. When resetting the batches, use
    atomic_long_read() directly instead of zone_page_state() to calculate the
    delta as the latter filters negative counter values.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: [3.12+]

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

30 Jan, 2014

1 commit

  • The VM is currently heavily tuned to avoid swapping. Whether that is
    good or bad is a separate discussion, but as long as the VM won't swap
    to make room for dirty cache, we can not consider anonymous pages when
    calculating the amount of dirtyable memory, the baseline to which
    dirty_background_ratio and dirty_ratio are applied.

    A simple workload that occupies a significant size (40+%, depending on
    memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
    uses the remainder for a streaming writer demonstrates this problem. In
    that case, the actual cache pages are a small fraction of what is
    considered dirtyable overall, which results in an relatively large
    portion of the cache pages to be dirtied. As kswapd starts rotating
    these, random tasks enter direct reclaim and stall on IO.

    Only consider free pages and file pages dirtyable.

    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Tested-by: Tejun Heo
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Wu Fengguang
    Reviewed-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

24 Jan, 2014

3 commits

  • Developers occasionally try and optimise PFN scanners by using
    page_order but miss that in general it requires zone->lock. This has
    happened twice for compaction.c and rejected both times. This patch
    clarifies the documentation of page_order and adds a note to
    compaction.c why page_order is not used.

    [akpm@linux-foundation.org: tweaks]
    [lauraa@codeaurora.org: Corrected a page_zone(page)->lock reference]
    Signed-off-by: Mel Gorman
    Acked-by: Rafael Aquini
    Acked-by: Minchan Kim
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • min_free_kbytes may be raised during THP's initialization. Sometimes,
    this will change the value which was set by the user. Showing this
    message will clarify this confusion.

    Only show this message when changing a value which was set by the user
    according to Michal Hocko's suggestion.

    Show the old value of min_free_kbytes according to Dave Hansen's
    suggestion. This will give user the chance to restore old value of
    min_free_kbytes.

    Signed-off-by: Han Pingtian
    Reviewed-by: Michal Hocko
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Han Pingtian
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Jan, 2014

2 commits

  • Cleanup. Change __get_page_tail_foll() to use get_huge_page_tail()
    to avoid the code duplication.

    Signed-off-by: Oleg Nesterov
    Cc: Thomas Gleixner
    Cc: Dave Jones
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This skips the _mapcount mangling for slab and hugetlbfs pages.

    The main trouble in doing this is to guarantee that PageSlab and
    PageHeadHuge remains constant for all get_page/put_page run on the tail
    of slab or hugetlbfs compound pages. Otherwise if they're set during
    get_page but not set during put_page, the _mapcount of the tail page
    would underflow.

    PageHeadHuge will remain true until the compound page is released and
    enters the buddy allocator so it won't risk to change even if the tail
    page is the last reference left on the page.

    PG_slab instead is cleared before the slab frees the head page with
    put_page, so if the tail pin is released after the slab freed the page,
    we would have a problem. But in the slab case the tail pin cannot be
    the last reference left on the page. This is because the slab code is
    free to reuse the compound page after a kfree/kmem_cache_free without
    having to check if there's any tail pin left. In turn all tail pins
    must be always released while the head is still pinned by the slab code
    and so we know PG_slab will be still set too.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Khalid Aziz
    Cc: Pravin Shelar
    Cc: Greg Kroah-Hartman
    Cc: Ben Hutchings
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

12 Sep, 2013

1 commit

  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du
     

10 Jul, 2013

1 commit


28 Feb, 2013

1 commit

  • munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
    at a time. When munlocking THP pages (or the huge zero page), this
    resulted in taking the mm->page_table_lock 512 times in a row.

    We can do better by making use of the page_mask returned by
    follow_page_mask (for the huge zero page case), or the size of the page
    munlock_vma_page() operated on (for the true THP page case).

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

24 Feb, 2013

1 commit

  • In find_extend_vma(), we don't need mlock_vma_pages_range() to verify
    the vma type - we know we're working with a stack. So, we can call
    directly into __mlock_vma_pages_range(), and remove the last
    make_pages_present() call site.

    Note that we don't use mm_populate() here, so we can't release the
    mmap_sem while allocating new stack pages. This is deemed acceptable,
    because the stack vmas grow by a bounded number of pages at a time, and
    these are anon pages so we don't have to read from disk to populate
    them.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

12 Jan, 2013

1 commit

  • Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
    waiting for POLLIN on a local TCP socket. It was easier to trigger if
    there was disk IO and dirty pages at the same time and he bisected it to
    commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
    immediately when it is made available").

    The intention of that patch was to improve high-order allocations under
    memory pressure after changes made to reclaim in 3.6 drastically hurt
    THP allocations but the approach was flawed. For Eric, the problem was
    that page->pfmemalloc was not being cleared for captured pages leading
    to a poor interaction with swap-over-NFS support causing the packets to
    be dropped. However, I identified a few more problems with the patch
    including the fact that it can increase contention on zone->lock in some
    cases which could result in async direct compaction being aborted early.

    In retrospect the capture patch took the wrong approach. What it should
    have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
    was allocating for THP and avoided races that way. While the patch was
    showing to improve allocation success rates at the time, the benefit is
    marginal given the relative complexity and it should be revisited from
    scratch in the context of the other reclaim-related changes that have
    taken place since the patch was first written and tested. This patch
    partially reverts commit 1fb3f8ca "mm: compaction: capture a suitable
    high-order page immediately when it is made available".

    Reported-and-tested-by: Eric Wong
    Tested-by: Eric Dumazet
    Cc: stable@vger.kernel.org
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

12 Dec, 2012

1 commit

  • Several place need to find the pmd by(mm_struct, address), so introduce a
    function to simplify it.

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Ni zhan Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     

11 Dec, 2012

1 commit

  • Note: This is very heavily based on a patch from Peter Zijlstra with
    fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch
    put a lot of migration logic into mm/huge_memory.c where it does
    not belong. This version puts tries to share some of the migration
    logic with migrate_misplaced_page. However, it should be noted
    that now migrate.c is doing more with the pagetable manipulation
    than is preferred. The end result is barely recognisable so as
    before, the signed-offs had to be removed but will be re-added if
    the original authors are ok with it.

    Add THP migration for the NUMA working set scanning fault case.

    It uses the page lock to serialize. No migration pte dance is
    necessary because the pte is already unmapped when we decide
    to migrate.

    [dhillf@gmail.com: Fix memory leak on isolation failure]
    [dhillf@gmail.com: Fix transfer of last_nid information]
    Signed-off-by: Mel Gorman

    Mel Gorman
     

09 Oct, 2012

12 commits

  • NR_MLOCK is only accounted in single page units: there's no logic to
    handle transparent hugepages. This patch checks the appropriate number of
    pages to adjust the statistics by so that the correct amount of memory is
    reflected.

    Currently:

    $ grep Mlocked /proc/meminfo
    Mlocked: 19636 kB

    #define MAP_SIZE (4 << 30) /* 4GB */

    void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    mlock(ptr, MAP_SIZE);

    $ grep Mlocked /proc/meminfo
    Mlocked: 29844 kB

    munlock(ptr, MAP_SIZE);

    $ grep Mlocked /proc/meminfo
    Mlocked: 19636 kB

    And with this patch:

    $ grep Mlock /proc/meminfo
    Mlocked: 19636 kB

    mlock(ptr, MAP_SIZE);

    $ grep Mlock /proc/meminfo
    Mlocked: 4213664 kB

    munlock(ptr, MAP_SIZE);

    $ grep Mlock /proc/meminfo
    Mlocked: 19636 kB

    Signed-off-by: David Rientjes
    Reported-by: Hugh Dickens
    Acked-by: Hugh Dickins
    Reviewed-by: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Reviewed-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
    contiguous memory space.

    This patch makes mlocked pages be migrated out. Of course, it can affect
    realtime processes but in CMA usecase, contiguous memory allocation failing
    is far worse than access latency to an mlocked page being variable while
    CMA is running. If someone wants to make the system realtime, he shouldn't
    enable CMA because stalls can still happen at random times.

    [akpm@linux-foundation.org: tweak comment text, per Mel]
    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We had thought that pages could no longer get freed while still marked as
    mlocked; but Johannes Weiner posted this program to demonstrate that
    truncating an mlocked private file mapping containing COWed pages is still
    mishandled:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    char *map;
    int fd;

    system("grep mlockfreed /proc/vmstat");
    fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
    unlink("chigurh");
    ftruncate(fd, 4096);
    map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
    map[0] = 11;
    mlock(map, sizeof(fd));
    ftruncate(fd, 0);
    close(fd);
    munlock(map, sizeof(fd));
    munmap(map, 4096);
    system("grep mlockfreed /proc/vmstat");
    return 0;
    }

    The anon COWed pages are not caught by truncation's clear_page_mlock() of
    the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
    look out for them there in page_remove_rmap(). Indeed, why should
    truncation or invalidation be doing the clear_page_mlock() when removing
    from pagecache? mlock is a property of mapping in userspace, not a
    property of pagecache: an mlocked unmapped page is nonsensical.

    Reported-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_evictable(page, vma) is an irritant: almost all its callers pass
    NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
    explicitly in the couple of places it's needed. But in those places we
    don't even need page_evictable() itself! They're dealing with a freshly
    allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This is almost entirely based on Rik's previous patches and discussions
    with him about how this might be implemented.

    Order > 0 compaction stops when enough free pages of the correct page
    order have been coalesced. When doing subsequent higher order
    allocations, it is possible for compaction to be invoked many times.

    However, the compaction code always starts out looking for things to
    compact at the start of the zone, and for free pages to compact things to
    at the end of the zone.

    This can cause quadratic behaviour, with isolate_freepages starting at the
    end of the zone each time, even though previous invocations of the
    compaction code already filled up all free memory on that end of the zone.
    This can cause isolate_freepages to take enormous amounts of CPU with
    certain workloads on larger memory systems.

    This patch caches where the migration and free scanner should start from
    on subsequent compaction invocations using the pageblock-skip information.
    When compaction starts it begins from the cached restart points and will
    update the cached restart points until a page is isolated or a pageblock
    is skipped that would have been scanned by synchronous compaction.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When compaction was implemented it was known that scanning could
    potentially be excessive. The ideal was that a counter be maintained for
    each pageblock but maintaining this information would incur a severe
    penalty due to a shared writable cache line. It has reached the point
    where the scanning costs are a serious problem, particularly on
    long-lived systems where a large process starts and allocates a large
    number of THPs at the same time.

    Instead of using a shared counter, this patch adds another bit to the
    pageblock flags called PG_migrate_skip. If a pageblock is scanned by
    either migrate or free scanner and 0 pages were isolated, the pageblock is
    marked to be skipped in the future. When scanning, this bit is checked
    before any scanning takes place and the block skipped if set.

    The main difficulty with a patch like this is "when to ignore the cached
    information?" If it's ignored too often, the scanning rates will still be
    excessive. If the information is too stale then allocations will fail
    that might have otherwise succeeded. In this patch

    o CMA always ignores the information
    o If the migrate and free scanner meet then the cached information will
    be discarded if it's at least 5 seconds since the last time the cache
    was discarded
    o If there are a large number of allocation failures, discard the cache.

    The time-based heuristic is very clumsy but there are few choices for a
    better event. Depending solely on multiple allocation failures still
    allows excessive scanning when THP allocations are failing in quick
    succession due to memory pressure. Waiting until memory pressure is
    relieved would cause compaction to continually fail instead of using
    reclaim/compaction to try allocate the page. The time-based mechanism is
    clumsy but a better option is not obvious.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Cc: Fengguang Wu
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Kyungmin Park
    Cc: Mark Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This reverts commit 7db8889ab05b ("mm: have order > 0 compaction start
    off where it left") and commit de74f1cc ("mm: have order > 0 compaction
    start near a pageblock with free pages"). These patches were a good
    idea and tests confirmed that they massively reduced the amount of
    scanning but the implementation is complex and tricky to understand. A
    later patch will cache what pageblocks should be skipped and
    reimplements the concept of compact_cached_free_pfn on top for both
    migration and free scanners.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • isolate_migratepages_range() might isolate no pages if for example when
    zone->lru_lock is contended and running asynchronous compaction. In this
    case, we should abort compaction, otherwise, compact_zone will run a
    useless loop and make zone->lru_lock is even contended.

    An additional check is added to ensure that cc.migratepages and
    cc.freepages get properly drained whan compaction is aborted.

    [minchan@kernel.org: Putback pages isolated for migration if aborting]
    [akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
    [akpm@linux-foundation.org: make compact_zone_order() require non-NULL arg `contended']
    [minchan@kernel.org: Putback pages isolated for migration if aborting]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Shaohua Li
    Signed-off-by: Mel Gorman
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • * Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok()
    (from Minchan Kim).

    * During watermark check decrease available free pages number by
    free CMA pages number if necessary (unmovable allocations cannot
    use pages from CMA areas).

    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     
  • Drop clean cache pages instead of migration during alloc_contig_range() to
    minimise allocation latency by reducing the amount of migration that is
    necessary. It's useful for CMA because latency of migration is more
    important than evicting the background process's working set. In
    addition, as pages are reclaimed then fewer free pages for migration
    targets are required so it avoids memory reclaiming to get free pages,
    which is a contributory factor to increased latency.

    I measured elapsed time of __alloc_contig_migrate_range() which migrates
    10M in 40M movable zone in QEMU machine.

    Before - 146ms, After - 7ms

    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Mel Gorman
    Signed-off-by: Minchan Kim
    Reviewed-by: Mel Gorman
    Cc: Marek Szyprowski
    Acked-by: Michal Nazarewicz
    Cc: Rik van Riel
    Tested-by: Kyungmin Park
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Make sure the #endif that terminates the standard #ifndef / #define /
    #endif construct gets labeled, and gets positioned at the end of the file
    as is normally the case.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • While compaction is migrating pages to free up large contiguous blocks
    for allocation it races with other allocation requests that may steal
    these blocks or break them up. This patch alters direct compaction to
    capture a suitable free page as soon as it becomes available to reduce
    this race. It uses similar logic to split_free_page() to ensure that
    watermarks are still obeyed.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

22 Aug, 2012

1 commit

  • Jim Schutt reported a problem that pointed at compaction contending
    heavily on locks. The workload is straight-forward and in his own words;

    The systems in question have 24 SAS drives spread across 3 HBAs,
    running 24 Ceph OSD instances, one per drive. FWIW these servers
    are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
    Ceph Linux clients doing dd simultaneously to a Ceph file system
    backed by 12 of these servers.

    Early in the test everything looks fine

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
    27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
    28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
    6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
    22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0

    and then it goes to pot

    procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
    207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
    123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
    123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
    622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
    223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0

    Note that system CPU usage is very high blocks being written out has
    dropped by 42%. He analysed this with perf and found

    perf record -g -a sleep 10
    perf report --sort symbol --call-graph fractal,5
    34.63% [k] _raw_spin_lock_irqsave
    |
    |--97.30%-- isolate_freepages
    | compaction_alloc
    | unmap_and_move
    | migrate_pages
    | compact_zone
    | compact_zone_order
    | try_to_compact_pages
    | __alloc_pages_direct_compact
    | __alloc_pages_slowpath
    | __alloc_pages_nodemask
    | alloc_pages_vma
    | do_huge_pmd_anonymous_page
    | handle_mm_fault
    | do_page_fault
    | page_fault
    | |
    | |--87.39%-- skb_copy_datagram_iovec
    | | tcp_recvmsg
    | | inet_recvmsg
    | | sock_recvmsg
    | | sys_recvfrom
    | | system_call
    | | __recv
    | | |
    | | --100.00%-- (nil)
    | |
    | --12.61%-- memcpy
    --2.70%-- [...]

    There was other data but primarily it is all showing that compaction is
    contended heavily on the zone->lock and zone->lru_lock.

    commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
    while isolating pages for migration] noted that it was possible for
    migration to hold the lru_lock for an excessive amount of time. Very
    broadly speaking this patch expands the concept.

    This patch introduces compact_checklock_irqsave() to check if a lock
    is contended or the process needs to be scheduled. If either condition
    is true then async compaction is aborted and the caller is informed.
    The page allocator will fail a THP allocation if compaction failed due
    to contention. This patch also introduces compact_trylock_irqsave()
    which will acquire the lock only if it is not contended and the process
    does not need to schedule.

    Reported-by: Jim Schutt
    Tested-by: Jim Schutt
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman