07 Nov, 2015

4 commits

  • Let's try to be consistent about data type of page order.

    [sfr@canb.auug.org.au: fix build (type of pageblock_order)]
    [hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrea Arcangeli
    Cc: "Paul E. McKenney"
    Cc: Andi Kleen
    Cc: Aneesh Kumar K.V
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Hugh has pointed that compound_head() call can be unsafe in some
    context. There's one example:

    CPU0 CPU1

    isolate_migratepages_block()
    page_count()
    compound_head()
    !!PageTail() == true
    put_page()
    tail->first_page = NULL
    head = tail->first_page
    alloc_pages(__GFP_COMP)
    prep_compound_page()
    tail->first_page = head
    __SetPageTail(p);
    !!PageTail() == true

    The race is pure theoretical. I don't it's possible to trigger it in
    practice. But who knows.

    We can fix the race by changing how encode PageTail() and compound_head()
    within struct page to be able to update them in one shot.

    The patch introduces page->compound_head into third double word block in
    front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
    the rest bits are pointer to head page if bit zero is set.

    The patch moves page->pmd_huge_pte out of word, just in case if an
    architecture defines pgtable_t into something what can have the bit 0
    set.

    hugetlb_cgroup uses page->lru.next in the second tail page to store
    pointer struct hugetlb_cgroup. The patch switch it to use page->private
    in the second tail page instead. The space is free since ->first_page is
    removed from the union.

    The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
    limitation, since there's now space in first tail page to store struct
    hugetlb_cgroup pointer. But that's out of scope of the patch.

    That means page->compound_head shares storage space with:

    - page->lru.next;
    - page->next;
    - page->rcu_head.next;

    That's too long list to be absolutely sure, but looks like nobody uses
    bit 0 of the word.

    page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
    call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
    call_rcu_lazy() is not allowed as it makes use of the bit and we can
    get false positive PageTail().

    [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Acked-by: Paul E. McKenney
    Cc: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Andrew stated the following

    We have quite a history of remote parts of the kernel using
    weird/wrong/inexplicable combinations of __GFP_ flags. I tend
    to think that this is because we didn't adequately explain the
    interface.

    And I don't think that gfp.h really improved much in this area as
    a result of this patchset. Could you go through it some time and
    decide if we've adequately documented all this stuff?

    This patches first moves some GFP flag combinations that are part of the MM
    internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
    bits under various headings and then documents the flag combinations. It
    will not help callers that are brain damaged but the clarity might motivate
    some fixes and avoid future mistakes.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • File-backed pages that will be immediately written are balanced between
    zones. This heuristic tries to avoid having a single zone filled with
    recently dirtied pages but the checks are unnecessarily expensive. Move
    consider_zone_balanced into the alloc_context instead of checking bitmaps
    multiple times. The patch also gives the parameter a more meaningful
    name.

    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

06 Nov, 2015

1 commit

  • Commit e6c509f85455 ("mm: use clear_page_mlock() in page_remove_rmap()")
    in v3.7 inadvertently made mlock_migrate_page() impotent: page migration
    unmaps the page from userspace before migrating, and that commit clears
    PageMlocked on the final unmap, leaving mlock_migrate_page() with
    nothing to do. Not a serious bug, the next attempt at reclaiming the
    page would fix it up; but a betrayal of page migration's intent - the
    new page ought to emerge as PageMlocked.

    I don't see how to fix it for mlock_migrate_page() itself; but easily
    fixed in remove_migration_pte(), by calling mlock_vma_page() when the vma
    is VM_LOCKED - under pte lock as in try_to_unmap_one().

    Delete mlock_migrate_page()? Not quite, it does still serve a purpose for
    migrate_misplaced_transhuge_page(): where we could replace it by a test,
    clear_page_mlock(), mlock_vma_page() sequence; but would that be an
    improvement? mlock_migrate_page() is fairly lean, and let's make it
    leaner by skipping the irq save/restore now clearly not needed.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Sep, 2015

1 commit

  • We cache isolate_start_pfn before entering isolate_migratepages(). If
    pageblock is skipped in isolate_migratepages() due to whatever reason,
    cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages
    that were freed. For example, the following scenario can be possible:

    - assume order-9 compaction, pageblock order is 9
    - start_isolate_pfn is 0x200
    - isolate_migratepages()
    - skip a number of pageblocks
    - start to isolate from pfn 0x600
    - cc->migrate_pfn = 0x620
    - return
    - last_migrated_pfn is set to 0x200
    - check flushing condition
    - current_block_start is set to 0x600
    - last_migrated_pfn < current_block_start then do useless flush

    This wrong flush would not help the performance and success rate so this
    patch tries to fix it. One simple way to know the exact position where
    we start to isolate migratable pages is that we cache it in
    isolate_migratepages() before entering actual isolation. This patch
    implements that and fixes the problem.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

05 Sep, 2015

2 commits

  • If a PTE is unmapped and it's dirty then it was writable recently. Due to
    deferred TLB flushing, it's best to assume a writable TLB cache entry
    exists. With that assumption, the TLB must be flushed before any IO can
    start or the page is freed to avoid lost writes or data corruption. This
    patch defers flushing of potentially writable TLBs as long as possible.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Dave Hansen
    Acked-by: Ingo Molnar
    Cc: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • An IPI is sent to flush remote TLBs when a page is unmapped that was
    potentially accesssed by other CPUs. There are many circumstances where
    this happens but the obvious one is kswapd reclaiming pages belonging to a
    running process as kswapd and the task are likely running on separate
    CPUs.

    On small machines, this is not a significant problem but as machine gets
    larger with more cores and more memory, the cost of these IPIs can be
    high. This patch uses a simple structure that tracks CPUs that
    potentially have TLB entries for pages being unmapped. When the unmapping
    is complete, the full TLB is flushed on the assumption that a refill cost
    is lower than flushing individual entries.

    Architectures wishing to do this must give the following guarantee.

    If a clean page is unmapped and not immediately flushed, the
    architecture must guarantee that a write to that linear address
    from a CPU with a cached TLB entry will trap a page fault.

    This is essentially what the kernel already depends on but the window is
    much larger with this patch applied and is worth highlighting. The
    architecture should consider whether the cost of the full TLB flush is
    higher than sending an IPI to flush each individual entry. An additional
    architecture helper called flush_tlb_local is required. It's a trivial
    wrapper with some accounting in the x86 case.

    The impact of this patch depends on the workload as measuring any benefit
    requires both mapped pages co-located on the LRU and memory pressure. The
    case with the biggest impact is multiple processes reading mapped pages
    taken from the vm-scalability test suite. The test case uses NR_CPU
    readers of mapped files that consume 10*RAM.

    Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%)
    Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%)
    Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%)

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    User 581.00 611.43
    System 5804.93 4111.76
    Elapsed 161.03 122.12

    This is showing that the readers completed 24.40% faster with 29% less
    system CPU time. From vmstats, it is known that the vanilla kernel was
    interrupted roughly 900K times per second during the steady phase of the
    test and the patched kernel was interrupts 180K times per second.

    The impact is lower on a single socket machine.

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%)
    Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%)
    Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%)

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    User 58.09 57.64
    System 111.82 76.56
    Elapsed 27.29 22.55

    It's still a noticeable improvement with vmstat showing interrupts went
    from roughly 500K per second to 45K per second.

    The patch will have no impact on workloads with no memory pressure or have
    relatively few mapped pages. It will have an unpredictable impact on the
    workload running on the CPU being flushed as it'll depend on how many TLB
    entries need to be refilled and how long that takes. Worst case, the TLB
    will be completely cleared of active entries when the target PFNs were not
    resident at all.

    [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Dave Hansen
    Acked-by: Ingo Molnar
    Cc: Linus Torvalds
    Signed-off-by: Sasha Levin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Jul, 2015

5 commits

  • Waiman Long reported that 24TB machines hit OOM during basic setup when
    struct page initialisation was deferred. One approach is to initialise
    memory on demand but it interferes with page allocator paths. This patch
    creates dedicated threads to initialise memory before basic setup. It
    then blocks on a rw_semaphore until completion as a wait_queue and counter
    is overkill. This may be slower to boot but it's simplier overall and
    also gets rid of a section mangling which existed so kswapd could do the
    initialisation.

    [akpm@linux-foundation.org: include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast]
    Signed-off-by: Mel Gorman
    Cc: Waiman Long
    Cc: Dave Hansen
    Cc: Scott Norton
    Tested-by: Daniel J Blueman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • mminit_verify_page_links() is an extremely paranoid check that was
    introduced when memory initialisation was being heavily reworked.
    Profiles indicated that up to 10% of parallel memory initialisation was
    spent on checking this for every page. The cost could be reduced but in
    practice this check only found problems very early during the
    initialisation rewrite and has found nothing since. This patch removes an
    expensive unnecessary check.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Only a subset of struct pages are initialised at the moment. When this
    patch is applied kswapd initialise the remaining struct pages in parallel.

    This should boot faster by spreading the work to multiple CPUs and
    initialising data that is local to the CPU. The user-visible effect on
    large machines is that free memory will appear to rapidly increase early
    in the lifetime of the system until kswapd reports that all memory is
    initialised in the kernel log. Once initialised there should be no other
    user-visibile effects.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch initalises all low memory struct pages and 2G of the highest
    zone on each node during memory initialisation if
    CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. That config option cannot be set
    but will be available in a later patch. Parallel initialisation of struct
    page depends on some features from memory hotplug and it is necessary to
    alter alter section annotations.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __free_pages_bootmem prepares a page for release to the buddy allocator
    and assumes that the struct page is initialised. Parallel initialisation
    of struct pages defers initialisation and __free_pages_bootmem can be
    called for struct pages that cannot yet map struct page to PFN. This
    patch passes PFN to __free_pages_bootmem with no other functional change.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

16 Apr, 2015

1 commit

  • We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
    tree since it doesn't work reliably on non-scalar types.

    This patch removes the rest of the usages of ACCESS_ONCE, and use the new
    READ_ONCE API for the read accesses. This makes things cleaner, instead
    of using separate/multiple sets of APIs.

    Signed-off-by: Jason Low
    Acked-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Low
     

15 Apr, 2015

2 commits

  • Compaction has anti fragmentation algorithm. It is that freepage should
    be more than pageblock order to finish the compaction if we don't find any
    freepage in requested migratetype buddy list. This is for mitigating
    fragmentation, but, there is a lack of migratetype consideration and it is
    too excessive compared to page allocator's anti fragmentation algorithm.

    Not considering migratetype would cause premature finish of compaction.
    For example, if allocation request is for unmovable migratetype, freepage
    with CMA migratetype doesn't help that allocation and compaction should
    not be stopped. But, current logic regards this situation as compaction
    is no longer needed, so finish the compaction.

    Secondly, condition is too excessive compared to page allocator's logic.
    We can steal freepage from other migratetype and change pageblock
    migratetype on more relaxed conditions in page allocator. This is
    designed to prevent fragmentation and we can use it here. Imposing hard
    constraint only to the compaction doesn't help much in this case since
    page allocator would cause fragmentation again.

    To solve these problems, this patch borrows anti fragmentation logic from
    page allocator. It will reduce premature compaction finish in some cases
    and reduce excessive compaction work.

    stress-highalloc test in mmtests with non movable order 7 allocation shows
    considerable increase of compaction success rate.

    Compaction success rate (Compaction success * 100 / Compaction stalls, %)
    31.82 : 42.20

    I tested it on non-reboot 5 runs stress-highalloc benchmark and found that
    there is no more degradation on allocation success rate than before. That
    roughly means that this patch doesn't result in more fragmentations.

    Vlastimil suggests additional idea that we only test for fallbacks when
    migration scanner has scanned a whole pageblock. It looked good for
    fragmentation because chance of stealing increase due to making more free
    pages in certain pageblock. So, I tested it, but, it results in decreased
    compaction success rate, roughly 38.00. I guess the reason that if system
    is low memory condition, watermark check could be failed due to not enough
    order 0 free page and so, sometimes, we can't reach a fallback check
    although migrate_pfn is aligned to pageblock_nr_pages. I can insert code
    to cope with this situation but it makes code more complicated so I don't
    include his idea at this patch.

    [akpm@linux-foundation.org: fix CONFIG_CMA=n build]
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • __mlock_vma_pages_range() doesn't necessarily mlock pages. It depends on
    vma flags. The same codepath is used for MAP_POPULATE.

    Let's rename __mlock_vma_pages_range() to populate_vma_page_range().

    This patch also drops mlock_vma_pages_range() references from
    documentation. It has gone in cea10a19b797 ("mm: directly use
    __mlock_vma_pages_range() in find_extend_vma()").

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Feb, 2015

1 commit

  • All users of mminit_dprintk pass a compile-time constant as level, so this
    just makes gcc emit a single printk call instead of two.

    Signed-off-by: Rasmus Villemoes
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Vishnu Pratap Singh
    Cc: Pintu Kumar
    Cc: Michal Nazarewicz
    Cc: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Tim Chen
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

12 Feb, 2015

1 commit

  • Expand the usage of the struct alloc_context introduced in the previous
    patch also for calling try_to_compact_pages(), to reduce the number of its
    parameters. Since the function is in different compilation unit, we need
    to move alloc_context definition in the shared mm/internal.h header.

    With this change we get simpler code and small savings of code size and stack
    usage:

    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
    function old new delta
    __alloc_pages_direct_compact 283 256 -27
    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13)
    function old new delta
    try_to_compact_pages 582 569 -13

    Stack usage of __alloc_pages_direct_compact goes from 24 to none (per
    scripts/checkstack.pl).

    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Zhang Yanfei
    Cc: Minchan Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

11 Dec, 2014

2 commits

  • Compaction caches the migration and free scanner positions between
    compaction invocations, so that the whole zone gets eventually scanned and
    there is no bias towards the initial scanner positions at the
    beginning/end of the zone.

    The cached positions are continuously updated as scanners progress and the
    updating stops as soon as a page is successfully isolated. The reasoning
    behind this is that a pageblock where isolation succeeded is likely to
    succeed again in near future and it should be worth revisiting it.

    However, the downside is that potentially many pages are rescanned without
    successful isolation. At worst, there might be a page where isolation
    from LRU succeeds but migration fails (potentially always). So upon
    encountering this page, cached position would always stop being updated
    for no good reason. It might have been useful to let such page be
    rescanned with sync compaction after async one failed, but this is now
    handled by caching scanner position for async and sync mode separately
    since commit 35979ef33931 ("mm, compaction: add per-zone migration pfn
    cache for async compaction").

    After this patch, cached positions are updated unconditionally. In
    stress-highalloc benchmark, this has decreased the numbers of scanned
    pages by few percent, without affecting allocation success rates.

    To prevent free scanner from leaving free pages behind after they are
    returned due to page migration failure, the cached scanner pfn is changed
    to point to the pageblock of the returned free page with the highest pfn,
    before leaving compact_zone().

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction relies on zone watermark checks for decisions such as if it's
    worth to start compacting in compaction_suitable() or whether compaction
    should stop in compact_finished(). The watermark checks take
    classzone_idx and alloc_flags parameters, which are related to the memory
    allocation request. But from the context of compaction they are currently
    passed as 0, including the direct compaction which is invoked to satisfy
    the allocation request, and could therefore know the proper values.

    The lack of proper values can lead to mismatch between decisions taken
    during compaction and decisions related to the allocation request. Lack
    of proper classzone_idx value means that lowmem_reserve is not taken into
    account. This has manifested (during recent changes to deferred
    compaction) when DMA zone was used as fallback for preferred Normal zone.
    compaction_suitable() without proper classzone_idx would think that the
    watermarks are already satisfied, but watermark check in
    get_page_from_freelist() would fail. Because of this problem, deferring
    compaction has extra complexity that can be removed in the following
    patch.

    The issue (not confirmed in practice) with missing alloc_flags is opposite
    in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
    ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
    CMA-enabled systems) the watermark checking in compaction with 0 passed
    will be stricter than in get_page_from_freelist(). In these cases
    compaction might be running for a longer time than is really needed.

    Another issue compaction_suitable() is that the check for "does the zone
    need compaction at all?" comes only after the check "does the zone have
    enough free free pages to succeed compaction". The latter considers extra
    pages for migration and can therefore in some situations fail and return
    COMPACT_SKIPPED, although the high-order allocation would succeed and we
    should return COMPACT_PARTIAL.

    This patch fixes these problems by adding alloc_flags and classzone_idx to
    struct compact_control and related functions involved in direct compaction
    and watermark checking. Where possible, all other callers of
    compaction_suitable() pass proper values where those are known. This is
    currently limited to classzone_idx, which is sometimes known in kswapd
    context. However, the direct reclaim callers should_continue_reclaim()
    and compaction_ready() do not currently know the proper values, so the
    coordination between reclaim and compaction may still not be as accurate
    as it could. This can be fixed later, if it's shown to be an issue.

    Additionaly the checks in compact_suitable() are reordered to address the
    second issue described above.

    The effect of this patch should be slightly better high-order allocation
    success rates and/or less compaction overhead, depending on the type of
    allocations and presence of CMA. It allows simplifying deferred
    compaction code in a followup patch.

    When testing with stress-highalloc, there was some slight improvement
    (which might be just due to variance) in success rates of non-THP-like
    allocations.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

14 Nov, 2014

1 commit

  • Current pageblock isolation logic could isolate each pageblock
    individually. This causes freepage accounting problem if freepage with
    pageblock order on isolate pageblock is merged with other freepage on
    normal pageblock. We can prevent merging by restricting max order of
    merging to pageblock order if freepage is on isolate pageblock.

    A side-effect of this change is that there could be non-merged buddy
    freepage even if finishing pageblock isolation, because undoing
    pageblock isolation is just to move freepage from isolate buddy list to
    normal buddy list rather than to consider merging. So, the patch also
    makes undoing pageblock isolation consider freepage merge. When
    un-isolation, freepage with more than pageblock order and it's buddy are
    checked. If they are on normal pageblock, instead of just moving, we
    isolate the freepage and free it in order to get merged.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Tang Chen
    Cc: Naoya Horiguchi
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Wen Congyang
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Laura Abbott
    Cc: Heesub Shin
    Cc: "Aneesh Kumar K.V"
    Cc: Ritesh Harjani
    Cc: Gioh Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

10 Oct, 2014

4 commits

  • struct compact_control currently converts the gfp mask to a migratetype,
    but we need the entire gfp mask in a follow-up patch.

    Pass the entire gfp mask as part of struct compact_control.

    Signed-off-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The migration scanner skips PageBuddy pages, but does not consider their
    order as checking page_order() is generally unsafe without holding the
    zone->lock, and acquiring the lock just for the check wouldn't be a good
    tradeoff.

    Still, this could avoid some iterations over the rest of the buddy page,
    and if we are careful, the race window between PageBuddy() check and
    page_order() is small, and the worst thing that can happen is that we skip
    too much and miss some isolation candidates. This is not that bad, as
    compaction can already fail for many other reasons like parallel
    allocations, and those have much larger race window.

    This patch therefore makes the migration scanner obtain the buddy page
    order and use it to skip the whole buddy page, if the order appears to be
    in the valid range.

    It's important that the page_order() is read only once, so that the value
    used in the checks and in the pfn calculation is the same. But in theory
    the compiler can replace the local variable by multiple inlines of
    page_order(). Therefore, the patch introduces page_order_unsafe() that
    uses ACCESS_ONCE to prevent this.

    Testing with stress-highalloc from mmtests shows a 15% reduction in number
    of pages scanned by migration scanner. The reduction is >60% with
    __GFP_NO_KSWAPD allocations, along with success rates better by few
    percent.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Async compaction aborts when it detects zone lock contention or
    need_resched() is true. David Rientjes has reported that in practice,
    most direct async compactions for THP allocation abort due to
    need_resched(). This means that a second direct compaction is never
    attempted, which might be OK for a page fault, but khugepaged is intended
    to attempt a sync compaction in such case and in these cases it won't.

    This patch replaces "bool contended" in compact_control with an int that
    distinguishes between aborting due to need_resched() and aborting due to
    lock contention. This allows propagating the abort through all compaction
    functions as before, but passing the abort reason up to
    __alloc_pages_slowpath() which decides when to continue with direct
    reclaim and another compaction attempt.

    Another problem is that try_to_compact_pages() did not act upon the
    reported contention (both need_resched() or lock contention) immediately
    and would proceed with another zone from the zonelist. When
    need_resched() is true, that means initializing another zone compaction,
    only to check again need_resched() in isolate_migratepages() and aborting.
    For zone lock contention, the unintended consequence is that the lock
    contended status reported back to the allocator is detrmined from the last
    zone where compaction was attempted, which is rather arbitrary.

    This patch fixes the problem in the following way:
    - async compaction of a zone aborting due to need_resched() or fatal signal
    pending means that further zones should not be tried. We report
    COMPACT_CONTENDED_SCHED to the allocator.
    - aborting zone compaction due to lock contention means we can still try
    another zone, since it has different set of locks. We report back
    COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
    it was aborted due to lock contention.

    As a result of these fixes, khugepaged will proceed with second sync
    compaction as intended, when the preceding async compaction aborted due to
    need_resched(). Page fault compactions aborting due to need_resched()
    will spare some cycles previously wasted by initializing another zone
    compaction only to abort again. Lock contention will be reported only
    when compaction in all zones aborted due to lock contention, and therefore
    it's not a good idea to try again after reclaim.

    In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
    has improved number of THP collapse allocations by 10%, which shows
    positive effect on khugepaged. The benchmark's success rates are
    unchanged as it is not recognized as khugepaged. Numbers of compact_stall
    and compact_fail events have however decreased by 20%, with
    compact_success still a bit improved, which is good. With benchmark
    configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
    collapse allocations, and only slight improvement in stalls and failures.

    [akpm@linux-foundation.org: fix warnings]
    Reported-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • isolate_migratepages_range() is the main function of the compaction
    scanner, called either on a single pageblock by isolate_migratepages()
    during regular compaction, or on an arbitrary range by CMA's
    __alloc_contig_migrate_range(). It currently perfoms two pageblock-wide
    compaction suitability checks, and because of the CMA callpath, it tracks
    if it crossed a pageblock boundary in order to repeat those checks.

    However, closer inspection shows that those checks are always true for CMA:
    - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
    - migrate_async_suitable() check is skipped because CMA uses sync compaction

    We can therefore move the compaction-specific checks to
    isolate_migratepages() and simplify isolate_migratepages_range().
    Furthermore, we can mimic the freepage scanner family of functions, which
    has isolate_freepages_block() function called both by compaction from
    isolate_freepages() and by CMA from isolate_freepages_range(), where each
    use-case adds own specific glue code. This allows further code
    simplification.

    Thus, we rename isolate_migratepages_range() to
    isolate_migratepages_block() and limit its functionality to a single
    pageblock (or its subset). For CMA, a new different
    isolate_migratepages_range() is created as a CMA-specific wrapper for the
    _block() function. The checks specific to compaction are moved to
    isolate_migratepages(). As part of the unification of these two families
    of functions, we remove the redundant zone parameter where applicable,
    since zone pointer is already passed in cc->zone.

    Furthermore, going back to compact_zone() and compact_finished() when
    pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
    - the checks are meant to skip pageblocks quickly. The patch therefore
    also introduces a simple loop into isolate_migratepages() so that it does
    not return immediately on failed pageblock checks, but keeps going until
    isolate_migratepages_range() gets called once. Similarily to
    isolate_freepages(), the function periodically checks if it needs to
    reschedule or abort async compaction.

    [iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

07 Aug, 2014

1 commit


05 Jun, 2014

5 commits

  • Compaction uses compact_checklock_irqsave() function to periodically check
    for lock contention and need_resched() to either abort async compaction,
    or to free the lock, schedule and retake the lock. When aborting,
    cc->contended is set to signal the contended state to the caller. Two
    problems have been identified in this mechanism.

    First, compaction also calls directly cond_resched() in both scanners when
    no lock is yet taken. This call either does not abort async compaction,
    or set cc->contended appropriately. This patch introduces a new
    compact_should_abort() function to achieve both. In isolate_freepages(),
    the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to
    match what the migration scanner does in the preliminary page checks. In
    case a pageblock is found suitable for calling isolate_freepages_block(),
    the checks within there are done on higher frequency.

    Second, isolate_freepages() does not check if isolate_freepages_block()
    aborted due to contention, and advances to the next pageblock. This
    violates the principle of aborting on contention, and might result in
    pageblocks not being scanned completely, since the scanning cursor is
    advanced. This problem has been noticed in the code by Joonsoo Kim when
    reviewing related patches. This patch makes isolate_freepages_block()
    check the cc->contended flag and abort.

    In case isolate_freepages() has already isolated some pages before
    aborting due to contention, page migration will proceed, which is OK since
    we do not want to waste the work that has been done, and page migration
    has own checks for contention. However, we do not want another isolation
    attempt by either of the scanners, so cc->contended flag check is added
    also to compaction_alloc() and compact_finished() to make sure compaction
    is aborted right after the migration.

    The outcome of the patch should be reduced lock contention by async
    compaction and lower latencies for higher-order allocations where direct
    compaction is involved.

    [akpm@linux-foundation.org: fix typo in comment]
    Reported-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: Michal Nazarewicz
    Tested-by: Shawn Guo
    Tested-by: Kevin Hilman
    Tested-by: Stephen Warren
    Tested-by: Fabio Estevam
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In previous commit(mm: use the light version __mod_zone_page_state in
    mlocked_vma_newpage()) a irq-unsafe __mod_zone_page_state is used. And as
    suggested by Andrew, to reduce the risks that new call sites incorrectly
    using mlocked_vma_newpage() without knowing they are adding racing, this
    patch folds mlocked_vma_newpage() into its only call site,
    page_add_new_anon_rmap, to make it open-cocded for people to know what is
    going on.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jianyu Zhan
    Suggested-by: Andrew Morton
    Suggested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • mlocked_vma_newpage() is called with pte lock held(a spinlock), which
    implies preemtion disabled, and the vm stat counter is not modified from
    interrupt context, so we need not use an irq-safe mod_zone_page_state()
    here, using a light-weight version __mod_zone_page_state() would be OK.

    This patch also documents __mod_zone_page_state() and some of its
    callsites. The comment above __mod_zone_page_state() is from Hugh
    Dickins, and acked by Christoph.

    Most credits to Hugh and Christoph for the clarification on the usage of
    the __mod_zone_page_state().

    [akpm@linux-foundation.org: coding-style fixes]
    Suggested-by: Andrew Morton
    Acked-by: Hugh Dickins
    Signed-off-by: Jianyu Zhan
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • We're going to want to manipulate the migration mode for compaction in the
    page allocator, and currently compact_control's sync field is only a bool.

    Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
    depending on the value of this bool. Convert the bool to enum
    migrate_mode and pass the migration mode in directly. Later, we'll want
    to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
    avoid unnecessary latency.

    This also alters compaction triggered from sysfs, either for the entire
    system or for a node, to force MIGRATE_SYNC.

    [akpm@linux-foundation.org: fix build]
    [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
    Signed-off-by: David Rientjes
    Suggested-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Cc: Naoya Horiguchi
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • mm/memory.c is overloaded: over 4k lines. get_user_pages() code is
    pretty much self-contained let's move it to separate file.

    No other changes made.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

08 Apr, 2014

2 commits

  • Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
    ra_submit with a single function call.

    Move ra_submit to internal.h and inline it to save some stack. Thanks
    to Andrew Morton for commenting different versions.

    Signed-off-by: Fabian Frederick
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • On NUMA systems, a node may start thrashing cache or even swap anonymous
    pages while there are still free pages on remote nodes.

    This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone
    allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect
    of fair allocation policy").

    Before those changes, the allocator would first try all allowed zones,
    including those on remote nodes, before waking any kswapds. But now,
    the allocator fastpath doubles as the fairness pass, which in turn can
    only consider the local node to prevent remote spilling based on
    exhausted fairness batches alone. Remote nodes are only considered in
    the slowpath, after the kswapds are woken up. But if remote nodes still
    have free memory, kswapd should not be woken to rebalance the local node
    or it may thrash cash or swap prematurely.

    Fix this by adding one more unfair pass over the zonelist that is
    allowed to spill to remote nodes after the local fairness pass fails but
    before entering the slowpath and waking the kswapds.

    This also gets rid of the GFP_THISNODE exemption from the fairness
    protocol because the unfair pass is no longer tied to kswapd, which
    GFP_THISNODE is not allowed to wake up.

    However, because remote spills can be more frequent now - we prefer them
    over local kswapd reclaim - the allocation batches on remote nodes could
    underflow more heavily. When resetting the batches, use
    atomic_long_read() directly instead of zone_page_state() to calculate the
    delta as the latter filters negative counter values.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: [3.12+]

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

30 Jan, 2014

1 commit

  • The VM is currently heavily tuned to avoid swapping. Whether that is
    good or bad is a separate discussion, but as long as the VM won't swap
    to make room for dirty cache, we can not consider anonymous pages when
    calculating the amount of dirtyable memory, the baseline to which
    dirty_background_ratio and dirty_ratio are applied.

    A simple workload that occupies a significant size (40+%, depending on
    memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
    uses the remainder for a streaming writer demonstrates this problem. In
    that case, the actual cache pages are a small fraction of what is
    considered dirtyable overall, which results in an relatively large
    portion of the cache pages to be dirtied. As kswapd starts rotating
    these, random tasks enter direct reclaim and stall on IO.

    Only consider free pages and file pages dirtyable.

    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Tested-by: Tejun Heo
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Wu Fengguang
    Reviewed-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

24 Jan, 2014

3 commits

  • Developers occasionally try and optimise PFN scanners by using
    page_order but miss that in general it requires zone->lock. This has
    happened twice for compaction.c and rejected both times. This patch
    clarifies the documentation of page_order and adds a note to
    compaction.c why page_order is not used.

    [akpm@linux-foundation.org: tweaks]
    [lauraa@codeaurora.org: Corrected a page_zone(page)->lock reference]
    Signed-off-by: Mel Gorman
    Acked-by: Rafael Aquini
    Acked-by: Minchan Kim
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • min_free_kbytes may be raised during THP's initialization. Sometimes,
    this will change the value which was set by the user. Showing this
    message will clarify this confusion.

    Only show this message when changing a value which was set by the user
    according to Michal Hocko's suggestion.

    Show the old value of min_free_kbytes according to Dave Hansen's
    suggestion. This will give user the chance to restore old value of
    min_free_kbytes.

    Signed-off-by: Han Pingtian
    Reviewed-by: Michal Hocko
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Han Pingtian
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Jan, 2014

2 commits

  • Cleanup. Change __get_page_tail_foll() to use get_huge_page_tail()
    to avoid the code duplication.

    Signed-off-by: Oleg Nesterov
    Cc: Thomas Gleixner
    Cc: Dave Jones
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This skips the _mapcount mangling for slab and hugetlbfs pages.

    The main trouble in doing this is to guarantee that PageSlab and
    PageHeadHuge remains constant for all get_page/put_page run on the tail
    of slab or hugetlbfs compound pages. Otherwise if they're set during
    get_page but not set during put_page, the _mapcount of the tail page
    would underflow.

    PageHeadHuge will remain true until the compound page is released and
    enters the buddy allocator so it won't risk to change even if the tail
    page is the last reference left on the page.

    PG_slab instead is cleared before the slab frees the head page with
    put_page, so if the tail pin is released after the slab freed the page,
    we would have a problem. But in the slab case the tail pin cannot be
    the last reference left on the page. This is because the slab code is
    free to reuse the compound page after a kfree/kmem_cache_free without
    having to check if there's any tail pin left. In turn all tail pins
    must be always released while the head is still pinned by the slab code
    and so we know PG_slab will be still set too.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Khalid Aziz
    Cc: Pravin Shelar
    Cc: Greg Kroah-Hartman
    Cc: Ben Hutchings
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

12 Sep, 2013

1 commit

  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du