09 Jan, 2020

1 commit

  • [ Upstream commit 89b15332af7c0312a41e50846819ca6613b58b4c ]

    One of our services is observing hanging ps/top/etc under heavy write
    IO, and the task states show this is an mmap_sem priority inversion:

    A write fault is holding the mmap_sem in read-mode and waiting for
    (heavily cgroup-limited) IO in balance_dirty_pages():

    balance_dirty_pages+0x724/0x905
    balance_dirty_pages_ratelimited+0x254/0x390
    fault_dirty_shared_page.isra.96+0x4a/0x90
    do_wp_page+0x33e/0x400
    __handle_mm_fault+0x6f0/0xfa0
    handle_mm_fault+0xe4/0x200
    __do_page_fault+0x22b/0x4a0
    page_fault+0x45/0x50

    Somebody tries to change the address space, contending for the mmap_sem in
    write-mode:

    call_rwsem_down_write_failed_killable+0x13/0x20
    do_mprotect_pkey+0xa8/0x330
    SyS_mprotect+0xf/0x20
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The waiting writer locks out all subsequent readers to avoid lock
    starvation, and several threads can be seen hanging like this:

    call_rwsem_down_read_failed+0x14/0x30
    proc_pid_cmdline_read+0xa0/0x480
    __vfs_read+0x23/0x140
    vfs_read+0x87/0x130
    SyS_read+0x42/0x90
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    To fix this, do what we do for cache read faults already: drop the
    mmap_sem before calling into anything IO bound, in this case the
    balance_dirty_pages() function, and return VM_FAULT_RETRY.

    Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Kirill A. Shutemov
    Cc: Josef Bacik
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Johannes Weiner
     

26 Sep, 2019

1 commit

  • Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.

    - Background

    The Android terminology used for forking a new process and starting an app
    from scratch is a cold start, while resuming an existing app is a hot
    start. While we continually try to improve the performance of cold
    starts, hot starts will always be significantly less power hungry as well
    as faster so we are trying to make hot start more likely than cold start.

    To increase hot start, Android userspace manages the order that apps
    should be killed in a process called ActivityManagerService.
    ActivityManagerService tracks every Android app or service that the user
    could be interacting with at any time and translates that into a ranked
    list for lmkd(low memory killer daemon). They are likely to be killed by
    lmkd if the system has to reclaim memory. In that sense they are similar
    to entries in any other cache. Those apps are kept alive for
    opportunistic performance improvements but those performance improvements
    will vary based on the memory requirements of individual workloads.

    - Problem

    Naturally, cached apps were dominant consumers of memory on the system.
    However, they were not significant consumers of swap even though they are
    good candidate for swap. Under investigation, swapping out only begins
    once the low zone watermark is hit and kswapd wakes up, but the overall
    allocation rate in the system might trip lmkd thresholds and cause a
    cached process to be killed(we measured performance swapping out vs.
    zapping the memory by killing a process. Unsurprisingly, zapping is 10x
    times faster even though we use zram which is much faster than real
    storage) so kill from lmkd will often satisfy the high zone watermark,
    resulting in very few pages actually being moved to swap.

    - Approach

    The approach we chose was to use a new interface to allow userspace to
    proactively reclaim entire processes by leveraging platform information.
    This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
    that are known to be cold from userspace and to avoid races with lmkd by
    reclaiming apps as soon as they entered the cached state. Additionally,
    it could provide many chances for platform to use much information to
    optimize memory efficiency.

    To achieve the goal, the patchset introduce two new options for madvise.
    One is MADV_COLD which will deactivate activated pages and the other is
    MADV_PAGEOUT which will reclaim private pages instantly. These new
    options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
    ways to gain some free memory space. MADV_PAGEOUT is similar to
    MADV_DONTNEED in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed immediately; MADV_COLD is similar
    to MADV_FREE in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed when memory pressure rises.

    This patch (of 5):

    When a process expects no accesses to a certain memory range, it could
    give a hint to kernel that the pages can be reclaimed when memory pressure
    happens but data should be preserved for future use. This could reduce
    workingset eviction so it ends up increasing performance.

    This patch introduces the new MADV_COLD hint to madvise(2) syscall.
    MADV_COLD can be used by a process to mark a memory range as not expected
    to be used in the near future. The hint can help kernel in deciding which
    pages to evict early during memory pressure.

    It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

    active file page -> inactive file LRU
    active anon page -> inacdtive anon LRU

    Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
    LRU's head because MADV_COLD is a little bit different symantic.
    MADV_FREE means it's okay to discard when the memory pressure because the
    content of the page is *garbage* so freeing such pages is almost zero
    overhead since we don't need to swap out and access afterward causes just
    minor fault. Thus, it would make sense to put those freeable pages in
    inactive file LRU to compete other used-once pages. It makes sense for
    implmentaion point of view, too because it's not swapbacked memory any
    longer until it would be re-dirtied. Even, it could give a bonus to make
    them be reclaimed on swapless system. However, MADV_COLD doesn't mean
    garbage so reclaiming them requires swap-out/in in the end so it's bigger
    cost. Since we have designed VM LRU aging based on cost-model, anonymous
    cold pages would be better to position inactive anon's LRU list, not file
    LRU. Furthermore, it would help to avoid unnecessary scanning if system
    doesn't have a swap device. Let's start simpler way without adding
    complexity at this moment. However, keep in mind, too that it's a caveat
    that workloads with a lot of pages cache are likely to ignore MADV_COLD on
    anonymous memory because we rarely age anonymous LRU lists.

    * man-page material

    MADV_COLD (since Linux x.x)

    Pages in the specified regions will be treated as less-recently-accessed
    compared to pages in the system with similar access frequencies. In
    contrast to MADV_FREE, the contents of the region are preserved regardless
    of subsequent writes to pages.

    MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
    pages.

    [akpm@linux-foundation.org: resolve conflicts with hmm.git]
    Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: kbuild test robot
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: James E.J. Bottomley
    Cc: Richard Henderson
    Cc: Ralf Baechle
    Cc: Chris Zankel
    Cc: Johannes Weiner
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Joel Fernandes (Google)
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 Mar, 2019

9 commits

  • Compaction is inherently race-prone as a suitable page freed during
    compaction can be allocated by any parallel task. This patch uses a
    capture_control structure to isolate a page immediately when it is freed
    by a direct compactor in the slow path of the page allocator. The
    intent is to avoid redundant scanning.

    5.0.0-rc1 5.0.0-rc1
    selective-v3r17 capture-v3r19
    Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
    Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%)
    Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%)
    Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%)
    Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%)
    Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%)
    Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%*
    Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%)
    Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%)

    Latency is only moderately affected but the devil is in the details. A
    closer examination indicates that base page fault latency is reduced but
    latency of huge pages is increased as it takes creater care to succeed.
    Part of the "problem" is that allocation success rates are close to 100%
    even when under pressure and compaction gets harder

    5.0.0-rc1 5.0.0-rc1
    selective-v3r17 capture-v3r19
    Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%)
    Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%)
    Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%)
    Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%)
    Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%)
    Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%)
    Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%)
    Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%)

    And scan rates are reduced as expected by 6% for the migration scanner
    and 29% for the free scanner indicating that there is less redundant
    work.

    Compaction migrate scanned 20815362 19573286
    Compaction free scanned 16352612 11510663

    [mgorman@techsingularity.net: remove redundant check]
    Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
    Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As compaction proceeds and creates high-order blocks, the free list
    search gets less efficient as the larger blocks are used as compaction
    targets. Eventually, the larger blocks will be behind the migration
    scanner for partially migrated pageblocks and the search fails. This
    patch round-robins what orders are searched so that larger blocks can be
    ignored and find smaller blocks that can be used as migration targets.

    The overall impact was small on 1-socket but it avoids corner cases
    where the migration/free scanners meet prematurely or situations where
    many of the pageblocks encountered by the free scanner are almost full
    instead of being properly packed. Previous testing had indicated that
    without this patch there were occasional large spikes in the free
    scanner without this patch.

    [dan.carpenter@oracle.com: fix static checker warning]
    Link: http://lkml.kernel.org/r/20190118175136.31341-20-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Pageblocks are marked for skip when no pages are isolated after a scan.
    However, it's possible to hit corner cases where the migration scanner
    gets stuck near the boundary between the source and target scanner. Due
    to pages being migrated in blocks of COMPACT_CLUSTER_MAX, pages that are
    migrated can be reallocated before the pageblock is complete. The
    pageblock is not necessarily skipped so it can be rescanned multiple
    times. Similarly, a pageblock with some dirty/writeback pages may fail
    to migrate and be rescanned until writeback completes which is wasteful.

    This patch tracks if a pageblock is being rescanned. If so, then the
    entire pageblock will be migrated as one operation. This narrows the
    race window during which pages can be reallocated during migration.
    Secondly, if there are pages that cannot be isolated then the pageblock
    will still be fully scanned and marked for skipping. On the second
    rescan, the pageblock skip is set and the migration scanner makes
    progress.

    5.0.0-rc1 5.0.0-rc1
    findfree-v3r16 norescan-v3r16
    Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
    Amean fault-both-3 3200.68 ( 0.00%) 3002.07 ( 6.21%)
    Amean fault-both-5 4847.75 ( 0.00%) 4684.47 ( 3.37%)
    Amean fault-both-7 6658.92 ( 0.00%) 6815.54 ( -2.35%)
    Amean fault-both-12 11077.62 ( 0.00%) 10864.02 ( 1.93%)
    Amean fault-both-18 12403.97 ( 0.00%) 12247.52 ( 1.26%)
    Amean fault-both-24 15607.10 ( 0.00%) 15683.99 ( -0.49%)
    Amean fault-both-30 18752.27 ( 0.00%) 18620.02 ( 0.71%)
    Amean fault-both-32 21207.54 ( 0.00%) 19250.28 * 9.23%*

    5.0.0-rc1 5.0.0-rc1
    findfree-v3r16 norescan-v3r16
    Percentage huge-3 96.86 ( 0.00%) 95.00 ( -1.91%)
    Percentage huge-5 93.72 ( 0.00%) 94.22 ( 0.53%)
    Percentage huge-7 94.31 ( 0.00%) 92.35 ( -2.08%)
    Percentage huge-12 92.66 ( 0.00%) 91.90 ( -0.82%)
    Percentage huge-18 91.51 ( 0.00%) 89.58 ( -2.11%)
    Percentage huge-24 90.50 ( 0.00%) 90.03 ( -0.52%)
    Percentage huge-30 91.57 ( 0.00%) 89.14 ( -2.65%)
    Percentage huge-32 91.00 ( 0.00%) 90.58 ( -0.46%)

    Negligible difference but this was likely a case when the specific
    corner case was not hit. A previous run of the same patch based on an
    earlier iteration of the series showed large differences where migration
    rates could be halved when the corner case was hit.

    The specific corner case where migration scan rates go through the roof
    was due to a dirty/writeback pageblock located at the boundary of the
    migration/free scanner did not happen in this case. When it does
    happen, the scan rates multipled by massive margins.

    Link: http://lkml.kernel.org/r/20190118175136.31341-13-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The migration scanner is a linear scan of a zone with a potentiall large
    search space. Furthermore, many pageblocks are unusable such as those
    filled with reserved pages or partially filled with pages that cannot
    migrate. These still get scanned in the common case of allocating a THP
    and the cost accumulates.

    The patch uses a partial search of the free lists to locate a migration
    source candidate that is marked as MOVABLE when allocating a THP. It
    prefers picking a block with a larger number of free pages already on
    the basis that there are fewer pages to migrate to free the entire
    block. The lowest PFN found during searches is tracked as the basis of
    the start for the linear search after the first search of the free list
    fails. After the search, the free list is shuffled so that the next
    search will not encounter the same page. If the search fails then the
    subsequent searches will be shorter and the linear scanner is used.

    If this search fails, or if the request is for a small or
    unmovable/reclaimable allocation then the linear scanner is still used.
    It is somewhat pointless to use the list search in those cases. Small
    free pages must be used for the search and there is no guarantee that
    movable pages are located within that block that are contiguous.

    5.0.0-rc1 5.0.0-rc1
    noboost-v3r10 findmig-v3r15
    Amean fault-both-3 3771.41 ( 0.00%) 3390.40 ( 10.10%)
    Amean fault-both-5 5409.05 ( 0.00%) 5082.28 ( 6.04%)
    Amean fault-both-7 7040.74 ( 0.00%) 7012.51 ( 0.40%)
    Amean fault-both-12 11887.35 ( 0.00%) 11346.63 ( 4.55%)
    Amean fault-both-18 16718.19 ( 0.00%) 15324.19 ( 8.34%)
    Amean fault-both-24 21157.19 ( 0.00%) 16088.50 * 23.96%*
    Amean fault-both-30 21175.92 ( 0.00%) 18723.42 * 11.58%*
    Amean fault-both-32 21339.03 ( 0.00%) 18612.01 * 12.78%*

    5.0.0-rc1 5.0.0-rc1
    noboost-v3r10 findmig-v3r15
    Percentage huge-3 86.50 ( 0.00%) 89.83 ( 3.85%)
    Percentage huge-5 92.52 ( 0.00%) 91.96 ( -0.61%)
    Percentage huge-7 92.44 ( 0.00%) 92.85 ( 0.44%)
    Percentage huge-12 92.98 ( 0.00%) 92.74 ( -0.25%)
    Percentage huge-18 91.70 ( 0.00%) 91.71 ( 0.02%)
    Percentage huge-24 91.59 ( 0.00%) 92.13 ( 0.60%)
    Percentage huge-30 90.14 ( 0.00%) 93.79 ( 4.04%)
    Percentage huge-32 90.03 ( 0.00%) 91.27 ( 1.37%)

    This shows an improvement in allocation latencies with similar
    allocation success rates. While not presented, there was a 31%
    reduction in migration scanning and a 8% reduction on system CPU usage.
    A 2-socket machine showed similar benefits.

    [mgorman@techsingularity.net: several fixes]
    Link: http://lkml.kernel.org/r/20190204120111.GL9565@techsingularity.net
    [vbabka@suse.cz: migrate block that was found-fast, some optimisations]
    Link: http://lkml.kernel.org/r/20190118175136.31341-10-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When compaction is finishing, it uses a flag to ensure the pageblock is
    complete but it makes sense to always complete migration of a pageblock.
    Minimally, skip information is based on a pageblock and partially
    scanned pageblocks may incur more scanning in the future. The pageblock
    skip handling also becomes more strict later in the series and the hint
    is more useful if a complete pageblock was always scanned.

    The potentially impacts latency as more scanning is done but it's not a
    consistent win or loss as the scanning is not always a high percentage
    of the pageblock and sometimes it is offset by future reductions in
    scanning. Hence, the results are not presented this time due to a
    misleading mix of gains/losses without any clear pattern. However, full
    scanning of the pageblock is important for later patches.

    Link: http://lkml.kernel.org/r/20190118175136.31341-8-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The last_migrated_pfn field is a bit dubious as to whether it really
    helps but either way, the information from it can be inferred without
    increasing the size of compact_control so remove the field.

    Link: http://lkml.kernel.org/r/20190118175136.31341-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • compact_control spans two cache lines with write-intensive lines on
    both. Rearrange so the most write-intensive fields are in the same
    cache line. This has a negligible impact on the overall performance of
    compaction and is more a tidying exercise than anything.

    Link: http://lkml.kernel.org/r/20190118175136.31341-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patch series "Increase success rates and reduce latency of compaction", v3.

    This series reduces scan rates and success rates of compaction,
    primarily by using the free lists to shorten scans, better controlling
    of skip information and whether multiple scanners can target the same
    block and capturing pageblocks before being stolen by parallel requests.
    The series is based on mmotm from January 9th, 2019 with the previous
    compaction series reverted.

    I'm mostly using thpscale to measure the impact of the series. The
    benchmark creates a large file, maps it, faults it, punches holes in the
    mapping so that the virtual address space is fragmented and then tries
    to allocate THP. It re-executes for different numbers of threads. From
    a fragmentation perspective, the workload is relatively benign but it
    does stress compaction.

    The overall impact on latencies for a 1-socket machine is

    baseline patches
    Amean fault-both-3 3832.09 ( 0.00%) 2748.56 * 28.28%*
    Amean fault-both-5 4933.06 ( 0.00%) 4255.52 ( 13.73%)
    Amean fault-both-7 7017.75 ( 0.00%) 6586.93 ( 6.14%)
    Amean fault-both-12 11610.51 ( 0.00%) 9162.34 * 21.09%*
    Amean fault-both-18 17055.85 ( 0.00%) 11530.06 * 32.40%*
    Amean fault-both-24 19306.27 ( 0.00%) 17956.13 ( 6.99%)
    Amean fault-both-30 22516.49 ( 0.00%) 15686.47 * 30.33%*
    Amean fault-both-32 23442.93 ( 0.00%) 16564.83 * 29.34%*

    The allocation success rates are much improved

    baseline patches
    Percentage huge-3 85.99 ( 0.00%) 97.96 ( 13.92%)
    Percentage huge-5 88.27 ( 0.00%) 96.87 ( 9.74%)
    Percentage huge-7 85.87 ( 0.00%) 94.53 ( 10.09%)
    Percentage huge-12 82.38 ( 0.00%) 98.44 ( 19.49%)
    Percentage huge-18 83.29 ( 0.00%) 99.14 ( 19.04%)
    Percentage huge-24 81.41 ( 0.00%) 97.35 ( 19.57%)
    Percentage huge-30 80.98 ( 0.00%) 98.05 ( 21.08%)
    Percentage huge-32 80.53 ( 0.00%) 97.06 ( 20.53%)

    That's a nearly perfect allocation success rate.

    The biggest impact is on the scan rates

    Compaction migrate scanned 55893379 19341254
    Compaction free scanned 474739990 11903963

    The number of pages scanned for migration was reduced by 65% and the
    free scanner was reduced by 97.5%. So much less work in exchange for
    lower latency and better success rates.

    The series was also evaluated using a workload that heavily fragments
    memory but the benefits there are also significant, albeit not
    presented.

    It was commented that we should be rethinking scanning entirely and to a
    large extent I agree. However, to achieve that you need a lot of this
    series in place first so it's best to make the linear scanners as best
    as possible before ripping them out.

    This patch (of 22):

    The isolate and migrate scanners should never isolate more than a
    pageblock of pages so unsigned int is sufficient saving 8 bytes on a
    64-bit build.

    Link: http://lkml.kernel.org/r/20190118175136.31341-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When freeing pages are done with higher order, time spent on coalescing
    pages by buddy allocator can be reduced. With section size of 256MB,
    hot add latency of a single section shows improvement from 50-60 ms to
    less than 1 ms, hence improving the hot add latency by 60 times. Modify
    external providers of online callback to align with the change.

    [arunks@codeaurora.org: v11]
    Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
    [akpm@linux-foundation.org: remove unused local, per Arun]
    [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
    [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
    [arunks@codeaurora.org: v8]
    Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
    [arunks@codeaurora.org: v9]
    Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
    Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Reviewed-by: Alexander Duyck
    Cc: K. Y. Srinivasan
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Dan Williams
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Greg Kroah-Hartman
    Cc: Mathieu Malaterre
    Cc: "Kirill A. Shutemov"
    Cc: Souptick Joarder
    Cc: Mel Gorman
    Cc: Aaron Lu
    Cc: Srivatsa Vaddagiri
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

29 Dec, 2018

3 commits

  • This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
    into alloc_flags. This is a preparation patch only that avoids having to
    pass gfp_mask through a long callchain in a future patch.

    Note that the setting in the fast path happens in alloc_flags_nofragment()
    and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
    That's true in this patch but is not true later so it's done now for
    easier review to show where the flag needs to be recorded.

    No functional change.

    [mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case]
    Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net
    Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patch series "Fragmentation avoidance improvements", v5.

    It has been noted before that fragmentation avoidance (aka
    anti-fragmentation) is not perfect. Given sufficient time or an adverse
    workload, memory gets fragmented and the long-term success of high-order
    allocations degrades. This series defines an adverse workload, a definition
    of external fragmentation events (including serious) ones and a series
    that reduces the level of those fragmentation events.

    The details of the workload and the consequences are described in more
    detail in the changelogs. However, from patch 1, this is a high-level
    summary of the adverse workload. The exact details are found in the
    mmtests implementation.

    The broad details of the workload are as follows;

    1. Create an XFS filesystem (not specified in the configuration but done
    as part of the testing for this patch)
    2. Start 4 fio threads that write a number of 64K files inefficiently.
    Inefficiently means that files are created on first access and not
    created in advance (fio parameterr create_on_open=1) and fallocate
    is not used (fallocate=none). With multiple IO issuers this creates
    a mix of slab and page cache allocations over time. The total size
    of the files is 150% physical memory so that the slabs and page cache
    pages get mixed
    3. Warm up a number of fio read-only threads accessing the same files
    created in step 2. This part runs for the same length of time it
    took to create the files. It'll fault back in old data and further
    interleave slab and page cache allocations. As it's now low on
    memory due to step 2, fragmentation occurs as pageblocks get
    stolen.
    4. While step 3 is still running, start a process that tries to allocate
    75% of memory as huge pages with a number of threads. The number of
    threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
    threads contending with fio, any other threads or forcing cross-NUMA
    scheduling. Note that the test has not been used on a machine with less
    than 8 cores. The benchmark records whether huge pages were allocated
    and what the fault latency was in microseconds
    5. Measure the number of events potentially causing external fragmentation,
    the fault latency and the huge page allocation success rate.
    6. Cleanup

    Overall the series reduces external fragmentation causing events by over 94%
    on 1 and 2 socket machines, which in turn impacts high-order allocation
    success rates over the long term. There are differences in latencies and
    high-order allocation success rates. Latencies are a mixed bag as they
    are vulnerable to exact system state and whether allocations succeeded
    so they are treated as a secondary metric.

    Patch 1 uses lower zones if they are populated and have free memory
    instead of fragmenting a higher zone. It's special cased to
    handle a Normal->DMA32 fallback with the reasons explained
    in the changelog.

    Patch 2-4 boosts watermarks temporarily when an external fragmentation
    event occurs. kswapd wakes to reclaim a small amount of old memory
    and then wakes kcompactd on completion to recover the system
    slightly. This introduces some overhead in the slowpath. The level
    of boosting can be tuned or disabled depending on the tolerance
    for fragmentation vs allocation latency.

    Patch 5 stalls some movable allocation requests to let kswapd from patch 4
    make some progress. The duration of the stalls is very low but it
    is possible to tune the system to avoid fragmentation events if
    larger stalls can be tolerated.

    The bulk of the improvement in fragmentation avoidance is from patches
    1-4 but patch 5 can deal with a rare corner case and provides the option
    of tuning a system for THP allocation success rates in exchange for
    some stalls to control fragmentation.

    This patch (of 5):

    The page allocator zone lists are iterated based on the watermarks of each
    zone which does not take anti-fragmentation into account. On x86, node 0
    may have multiple zones while other nodes have one zone. A consequence is
    that tasks running on node 0 may fragment ZONE_NORMAL even though
    ZONE_DMA32 has plenty of free memory. This patch special cases the
    allocator fast path such that it'll try an allocation from a lower local
    zone before fragmenting a higher zone. In this case, stealing of
    pageblocks or orders larger than a pageblock are still allowed in the fast
    path as they are uninteresting from a fragmentation point of view.

    This was evaluated using a benchmark designed to fragment memory before
    attempting THP allocations. It's implemented in mmtests as the following
    configurations

    configs/config-global-dhp__workload_thpfioscale
    configs/config-global-dhp__workload_thpfioscale-defrag
    configs/config-global-dhp__workload_thpfioscale-madvhugepage

    e.g. from mmtests
    ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1

    The broad details of the workload are as follows;

    1. Create an XFS filesystem (not specified in the configuration but done
    as part of the testing for this patch).
    2. Start 4 fio threads that write a number of 64K files inefficiently.
    Inefficiently means that files are created on first access and not
    created in advance (fio parameter create_on_open=1) and fallocate
    is not used (fallocate=none). With multiple IO issuers this creates
    a mix of slab and page cache allocations over time. The total size
    of the files is 150% physical memory so that the slabs and page cache
    pages get mixed.
    3. Warm up a number of fio read-only processes accessing the same files
    created in step 2. This part runs for the same length of time it
    took to create the files. It'll refault old data and further
    interleave slab and page cache allocations. As it's now low on
    memory due to step 2, fragmentation occurs as pageblocks get
    stolen.
    4. While step 3 is still running, start a process that tries to allocate
    75% of memory as huge pages with a number of threads. The number of
    threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
    threads contending with fio, any other threads or forcing cross-NUMA
    scheduling. Note that the test has not been used on a machine with less
    than 8 cores. The benchmark records whether huge pages were allocated
    and what the fault latency was in microseconds.
    5. Measure the number of events potentially causing external fragmentation,
    the fault latency and the huge page allocation success rate.
    6. Cleanup the test files.

    Note that due to the use of IO and page cache that this benchmark is not
    suitable for running on large machines where the time to fragment memory
    may be excessive. Also note that while this is one mix that generates
    fragmentation that it's not the only mix that generates fragmentation.
    Differences in workload that are more slab-intensive or whether SLUB is
    used with high-order pages may yield different results.

    When the page allocator fragments memory, it records the event using the
    mm_page_alloc_extfrag ftrace event. If the fallback_order is smaller than
    a pageblock order (order-9 on 64-bit x86) then it's considered to be an
    "external fragmentation event" that may cause issues in the future.
    Hence, the primary metric here is the number of external fragmentation
    events that occur with order < 9. The secondary metric is allocation
    latency and huge page allocation success rates but note that differences
    in latencies and what the success rate also can affect the number of
    external fragmentation event which is why it's a secondary metric.

    1-socket Skylake machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 1 THP allocating thread
    --------------------------------------

    4.20-rc3 extfrag events < order 9: 804694
    4.20-rc3+patch: 408912 (49% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-1 662.92 ( 0.00%) 653.58 * 1.41%*
    Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)

    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)

    Fault latencies are slightly reduced while allocation success rates remain
    at zero as this configuration does not make any special effort to allocate
    THP and fio is heavily active at the time and either filling memory or
    keeping pages resident. However, a 49% reduction of serious fragmentation
    events reduces the changes of external fragmentation being a problem in
    the future.

    Vlastimil asked during review for a breakdown of the allocation types
    that are falling back.

    vanilla
    3816 MIGRATE_UNMOVABLE
    800845 MIGRATE_MOVABLE
    33 MIGRATE_UNRECLAIMABLE

    patch
    735 MIGRATE_UNMOVABLE
    408135 MIGRATE_MOVABLE
    42 MIGRATE_UNRECLAIMABLE

    The majority of the fallbacks are due to movable allocations and this is
    consistent for the workload throughout the series so will not be presented
    again as the primary source of fallbacks are movable allocations.

    Movable fallbacks are sometimes considered "ok" to fallback because they
    can be migrated. The problem is that they can fill an
    unmovable/reclaimable pageblock causing those allocations to fallback
    later and polluting pageblocks with pages that cannot move. If there is a
    movable fallback, it is pretty much guaranteed to affect an
    unmovable/reclaimable pageblock and while it might not be enough to
    actually cause a unmovable/reclaimable fallback in the future, we cannot
    know that in advance so the patch takes the only option available to it.
    Hence, it's important to control them. This point is also consistent
    throughout the series and will not be repeated.

    1-socket Skylake machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 291392
    4.20-rc3+patch: 191187 (34% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-1 1495.14 ( 0.00%) 1467.55 ( 1.85%)
    Amean fault-huge-1 1098.48 ( 0.00%) 1127.11 ( -2.61%)

    thpfioscale Percentage Faults Huge
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-1 78.57 ( 0.00%) 77.64 ( -1.18%)

    Fragmentation events were reduced quite a bit although this is known
    to be a little variable. The latencies and allocation success rates
    are similar but they were already quite high.

    2-socket Haswell machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 5 THP allocating threads
    ----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 215698
    4.20-rc3+patch: 200210 (7% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-5 1350.05 ( 0.00%) 1346.45 ( 0.27%)
    Amean fault-huge-5 4181.01 ( 0.00%) 3418.60 ( 18.24%)

    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-5 1.15 ( 0.00%) 0.78 ( -31.88%)

    The reduction of external fragmentation events is slight and this is
    partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f
    ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
    allocations can now spill over to remote nodes instead of fragmenting
    local memory.

    2-socket Haswell machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 166352
    4.20-rc3+patch: 147463 (11% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-5 6138.97 ( 0.00%) 6217.43 ( -1.28%)
    Amean fault-huge-5 2294.28 ( 0.00%) 3163.33 * -37.88%*

    thpfioscale Percentage Faults Huge
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-5 96.82 ( 0.00%) 95.14 ( -1.74%)

    There was a slight reduction in external fragmentation events although the
    latencies were higher. The allocation success rate is high enough that
    the system is struggling and there is quite a lot of parallel reclaim and
    compaction activity. There is also a certain degree of luck on whether
    processes start on node 0 or not for this patch but the relevance is
    reduced later in the series.

    Overall, the patch reduces the number of external fragmentation causing
    events so the success of THP over long periods of time would be improved
    for this adverse workload.

    Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Zi Yan
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit fa5e084e43eb ("vmscan: do not unconditionally treat zones that
    fail zone_reclaim() as full") changed the return value of
    node_reclaim(). The original return value 0 means NODE_RECLAIM_SOME
    after this commit.

    While the return value of node_reclaim() when CONFIG_NUMA is n is not
    changed. This will leads to call zone_watermark_ok() again.

    This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
    Since node_reclaim() is only called in page_alloc.c, move it to
    mm/internal.h.

    Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Acked-by: Michal Hocko
    Reviewed-by: Matthew Wilcox
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     

31 Oct, 2018

1 commit

  • The conversion is done using

    sed -i 's@__free_pages_bootmem@memblock_free_pages@' \
    $(git grep -l __free_pages_bootmem)

    Link: http://lkml.kernel.org/r/1536927045-23536-27-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

24 Aug, 2018

1 commit

  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    The aim is to change the return type of finish_fault() and
    handle_mm_fault() to vm_fault_t type. As part of that clean up return
    type of all other recursively called functions have been changed to
    vm_fault_t type.

    The places from where handle_mm_fault() is getting invoked will be
    change to vm_fault_t type but in a separate patch.

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
    Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

23 Aug, 2018

1 commit

  • __paginginit is the same thing as __meminit except for platforms without
    sparsemem, there it is defined as __init.

    Remove __paginginit and use __meminit. Use __ref in one single function
    that merges __meminit and __init sections: setup_usemap().

    Link: http://lkml.kernel.org/r/20180801122348.21588-4-osalvador@techadventures.net
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Oscar Salvador
    Reviewed-by: Oscar Salvador
    Cc: Pasha Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

06 Jun, 2018

1 commit

  • Pull xfs updates from Darrick Wong:
    "New features this cycle include the ability to relabel mounted
    filesystems, support for fallocated swapfiles, and using FUA for pure
    data O_DSYNC directio writes. With this cycle we begin to integrate
    online filesystem repair and refactor the growfs code in preparation
    for eventual subvolume support, though the road ahead for both
    features is quite long.

    There are also numerous refactorings of the iomap code to remove
    unnecessary log overhead, to disentangle some of the quota code, and
    to prepare for buffer head removal in a future upstream kernel.

    Metadata validation continues to improve, both in the hot path
    veifiers and the online filesystem check code. I anticipate sending a
    second pull request in a few days with more metadata validation
    improvements.

    This series has been run through a full xfstests run over the weekend
    and through a quick xfstests run against this morning's master, with
    no major failures reported.

    Summary:

    - Strengthen inode number and structure validation when allocating
    inodes.

    - Reduce pointless buffer allocations during cache miss

    - Use FUA for pure data O_DSYNC directio writes

    - Various iomap refactorings

    - Strengthen quota metadata verification to avoid unfixable broken
    quota

    - Make AGFL block freeing a deferred operation to avoid blowing out
    transaction reservations when running complex operations

    - Get rid of the log item descriptors to reduce log overhead

    - Fix various reflink bugs where inodes were double-joined to
    transactions

    - Don't issue discards when trimming unwritten extents

    - Refactor incore dquot initialization and retrieval interfaces

    - Fix some locking problmes in the quota scrub code

    - Strengthen btree structure checks in scrub code

    - Rewrite swapfile activation to use iomap and support unwritten
    extents

    - Make scrub exit to userspace sooner when corruptions or
    cross-referencing problems are found

    - Make scrub invoke the data fork scrubber directly on metadata
    inodes

    - Don't do background reclamation of post-eof and cow blocks when the
    fs is suspended

    - Fix secondary superblock buffer lifespan hinting

    - Refactor growfs to use table-dispatched functions instead of long
    stringy functions

    - Move growfs code to libxfs

    - Implement online fs label getting and setting

    - Introduce online filesystem repair (in a very limited capacity)

    - Fix unit conversion problems in the realtime freemap iteration
    functions

    - Various refactorings and cleanups in preparation to remove buffer
    heads in a future release

    - Reimplement the old bmap call with iomap

    - Remove direct buffer head accesses from seek hole/data

    - Various bug fixes"

    * tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (121 commits)
    fs: use ->is_partially_uptodate in page_cache_seek_hole_data
    fs: remove the buffer_unwritten check in page_seek_hole_data
    fs: move page_cache_seek_hole_data to iomap.c
    xfs: use iomap_bmap
    iomap: add an iomap-based bmap implementation
    iomap: add a iomap_sector helper
    iomap: use __bio_add_page in iomap_dio_zero
    iomap: move IOMAP_F_BOUNDARY to gfs2
    iomap: fix the comment describing IOMAP_NOWAIT
    iomap: inline data should be an iomap type, not a flag
    mm: split ->readpages calls to avoid non-contiguous pages lists
    mm: return an unsigned int from __do_page_cache_readahead
    mm: give the 'ret' variable a better name __do_page_cache_readahead
    block: add a lower-level bio_add_page interface
    xfs: fix error handling in xfs_refcount_insert()
    xfs: fix xfs_rtalloc_rec units
    xfs: strengthen rtalloc query range checks
    xfs: xfs_rtbuf_get should check the bmapi_read results
    xfs: xfs_rtword_t should be unsigned, not signed
    dax: change bdev_dax_supported() to support boolean returns
    ...

    Linus Torvalds
     

02 Jun, 2018

1 commit


25 May, 2018

1 commit

  • This reverts the following commits that change CMA design in MM.

    3d2054ad8c2d ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y")

    1d47a3ec09b5 ("mm/cma: remove ALLOC_CMA")

    bad8c6c0b114 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")

    Ville reported a following error on i386.

    Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
    microcode: microcode updated early to revision 0x4, date = 2013-06-28
    Initializing CPU#0
    Initializing HighMem for node 0 (000377fe:00118000)
    Initializing Movable for node 0 (00000001:00118000)
    BUG: Bad page state in process swapper pfn:377fe
    page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0
    flags: 0x80000000()
    raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001
    page dumped because: nonzero mapcount
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145
    Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013
    Call Trace:
    dump_stack+0x60/0x96
    bad_page+0x9a/0x100
    free_pages_check_bad+0x3f/0x60
    free_pcppages_bulk+0x29d/0x5b0
    free_unref_page_commit+0x84/0xb0
    free_unref_page+0x3e/0x70
    __free_pages+0x1d/0x20
    free_highmem_page+0x19/0x40
    add_highpages_with_active_regions+0xab/0xeb
    set_highmem_pages_init+0x66/0x73
    mem_init+0x1b/0x1d7
    start_kernel+0x17a/0x363
    i386_start_kernel+0x95/0x99
    startup_32_smp+0x164/0x168

    The reason for this error is that the span of MOVABLE_ZONE is extended
    to whole node span for future CMA initialization, and, normal memory is
    wrongly freed here. I submitted the fix and it seems to work, but,
    another problem happened.

    It's so late time to fix the later problem so I decide to reverting the
    series.

    Reported-by: Ville Syrjälä
    Acked-by: Laura Abbott
    Acked-by: Michal Hocko
    Cc: Andrew Morton
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

12 Apr, 2018

4 commits

  • Now, all reserved pages for CMA region are belong to the ZONE_MOVABLE
    and it only serves for a request with GFP_HIGHMEM && GFP_MOVABLE.

    Therefore, we don't need to maintain ALLOC_CMA at all.

    Link: http://lkml.kernel.org/r/1512114786-5085-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Tested-by: Tony Lindgren
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Laura Abbott
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Patch series "mm/cma: manage the memory of the CMA area by using the
    ZONE_MOVABLE", v2.

    0. History

    This patchset is the follow-up of the discussion about the "Introduce
    ZONE_CMA (v7)" [1]. Please reference it if more information is needed.

    1. What does this patch do?

    This patch changes the management way for the memory of the CMA area in
    the MM subsystem. Currently the memory of the CMA area is managed by
    the zone where their pfn is belong to. However, this approach has some
    problems since MM subsystem doesn't have enough logic to handle the
    situation that different characteristic memories are in a single zone.
    To solve this issue, this patch try to manage all the memory of the CMA
    area by using the MOVABLE zone. In MM subsystem's point of view,
    characteristic of the memory on the MOVABLE zone and the memory of the
    CMA area are the same. So, managing the memory of the CMA area by using
    the MOVABLE zone will not have any problem.

    2. Motivation

    There are some problems with current approach. See following. Although
    these problem would not be inherent and it could be fixed without this
    conception change, it requires many hooks addition in various code path
    and it would be intrusive to core MM and would be really error-prone.
    Therefore, I try to solve them with this new approach. Anyway,
    following is the problems of the current implementation.

    o CMA memory utilization

    First, following is the freepage calculation logic in MM.

    - For movable allocation: freepage = total freepage
    - For unmovable allocation: freepage = total freepage - CMA freepage

    Freepages on the CMA area is used after the normal freepages in the zone
    where the memory of the CMA area is belong to are exhausted. At that
    moment that the number of the normal freepages is zero, so

    - For movable allocation: freepage = total freepage = CMA freepage
    - For unmovable allocation: freepage = 0

    If unmovable allocation comes at this moment, allocation request would
    fail to pass the watermark check and reclaim is started. After reclaim,
    there would exist the normal freepages so freepages on the CMA areas
    would not be used.

    FYI, there is another attempt [2] trying to solve this problem in lkml.
    And, as far as I know, Qualcomm also has out-of-tree solution for this
    problem.

    Useless reclaim:

    There is no logic to distinguish CMA pages in the reclaim path. Hence,
    CMA page is reclaimed even if the system just needs the page that can be
    usable for the kernel allocation.

    Atomic allocation failure:

    This is also related to the fallback allocation policy for the memory of
    the CMA area. Consider the situation that the number of the normal
    freepages is *zero* since the bunch of the movable allocation requests
    come. Kswapd would not be woken up due to following freepage
    calculation logic.

    - For movable allocation: freepage = total freepage = CMA freepage

    If atomic unmovable allocation request comes at this moment, it would
    fails due to following logic.

    - For unmovable allocation: freepage = total freepage - CMA freepage = 0

    It was reported by Aneesh [3].

    Useless compaction:

    Usual high-order allocation request is unmovable allocation request and
    it cannot be served from the memory of the CMA area. In compaction,
    migration scanner try to migrate the page in the CMA area and make
    high-order page there. As mentioned above, it cannot be usable for the
    unmovable allocation request so it's just waste.

    3. Current approach and new approach

    Current approach is that the memory of the CMA area is managed by the
    zone where their pfn is belong to. However, these memory should be
    distinguishable since they have a strong limitation. So, they are
    marked as MIGRATE_CMA in pageblock flag and handled specially. However,
    as mentioned in section 2, the MM subsystem doesn't have enough logic to
    deal with this special pageblock so many problems raised.

    New approach is that the memory of the CMA area is managed by the
    MOVABLE zone. MM already have enough logic to deal with special zone
    like as HIGHMEM and MOVABLE zone. So, managing the memory of the CMA
    area by the MOVABLE zone just naturally work well because constraints
    for the memory of the CMA area that the memory should always be
    migratable is the same with the constraint for the MOVABLE zone.

    There is one side-effect for the usability of the memory of the CMA
    area. The use of MOVABLE zone is only allowed for a request with
    GFP_HIGHMEM && GFP_MOVABLE so now the memory of the CMA area is also
    only allowed for this gfp flag. Before this patchset, a request with
    GFP_MOVABLE can use them. IMO, It would not be a big issue since most
    of GFP_MOVABLE request also has GFP_HIGHMEM flag. For example, file
    cache page and anonymous page. However, file cache page for blockdev
    file is an exception. Request for it has no GFP_HIGHMEM flag. There is
    pros and cons on this exception. In my experience, blockdev file cache
    pages are one of the top reason that causes cma_alloc() to fail
    temporarily. So, we can get more guarantee of cma_alloc() success by
    discarding this case.

    Note that there is no change in admin POV since this patchset is just
    for internal implementation change in MM subsystem. Just one minor
    difference for admin is that the memory stat for CMA area will be
    printed in the MOVABLE zone. That's all.

    4. Result

    Following is the experimental result related to utilization problem.

    8 CPUs, 1024 MB, VIRTUAL MACHINE
    make -j16

    CMA area: 0 MB 512 MB
    Elapsed-time: 92.4 186.5
    pswpin: 82 18647
    pswpout: 160 69839

    CMA : 0 MB 512 MB
    Elapsed-time: 93.1 93.4
    pswpin: 84 46
    pswpout: 183 92

    akpm: "kernel test robot" reported a 26% improvement in
    vm-scalability.throughput:
    http://lkml.kernel.org/r/20180330012721.GA3845@yexl-desktop

    [1]: lkml.kernel.org/r/1491880640-9944-1-git-send-email-iamjoonsoo.kim@lge.com
    [2]: https://lkml.org/lkml/2014/10/15/623
    [3]: http://www.spinics.net/lists/linux-mm/msg100562.html

    Link: http://lkml.kernel.org/r/1512114786-5085-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Tested-by: Tony Lindgren
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Laura Abbott
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • No allocation callback is using this argument anymore. new_page_node
    used to use this parameter to convey node_id resp. migration error up
    to move_pages code (do_move_page_to_node_array). The error status never
    made it into the final status field and we have a better way to
    communicate node id to the status field now. All other allocation
    callbacks simply ignored the argument so we can drop it finally.

    [mhocko@suse.com: fix migration callback]
    Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
    [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
    [mhocko@kernel.org: fix build]
    Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "unclutter thp migration"

    Motivation:

    THP migration is hacked into the generic migration with rather
    surprising semantic. The migration allocation callback is supposed to
    check whether the THP can be migrated at once and if that is not the
    case then it allocates a simple page to migrate. unmap_and_move then
    fixes that up by splitting the THP into small pages while moving the
    head page to the newly allocated order-0 page. Remaining pages are
    moved to the LRU list by split_huge_page. The same happens if the THP
    allocation fails. This is really ugly and error prone [2].

    I also believe that split_huge_page to the LRU lists is inherently wrong
    because all tail pages are not migrated. Some callers will just work
    around that by retrying (e.g. memory hotplug). There are other pfn
    walkers which are simply broken though. e.g. madvise_inject_error will
    migrate head and then advances next pfn by the huge page size.
    do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
    will simply split the THP before migration if the THP migration is not
    supported then falls back to single page migration but it doesn't handle
    tail pages if the THP migration path is not able to allocate a fresh THP
    so we end up with ENOMEM and fail the whole migration which is a
    questionable behavior. Page compaction doesn't try to migrate large
    pages so it should be immune.

    The first patch reworks do_pages_move which relies on a very ugly
    calling semantic when the return status is pushed to the migration path
    via private pointer. It uses pre allocated fixed size batching to
    achieve that. We simply cannot do the same if a THP is to be split
    during the migration path which is done in the patch 3. Patch 2 is
    follow up cleanup which removes the mentioned return status calling
    convention ugliness.

    On a side note:

    There are some semantic issues I have encountered on the way when
    working on patch 1 but I am not addressing them here. E.g. trying to
    move THP tail pages will result in either success or EBUSY (the later
    one more likely once we isolate head from the LRU list). Hugetlb
    reports EACCESS on tail pages. Some errors are reported via status
    parameter but migration failures are not even though the original
    `reason' argument suggests there was an intention to do so. From a
    quick look into git history this never worked. I have tried to keep the
    semantic unchanged.

    Then there is a relatively minor thing that the page isolation might
    fail because of pages not being on the LRU - e.g. because they are
    sitting on the per-cpu LRU caches. Easily fixable.

    This patch (of 3):

    do_pages_move is supposed to move user defined memory (an array of
    addresses) to the user defined numa nodes (an array of nodes one for
    each address). The user provided status array then contains resulting
    numa node for each address or an error. The semantic of this function
    is little bit confusing because only some errors are reported back.
    Notably migrate_pages error is only reported via the return value. This
    patch doesn't try to address these semantic nuances but rather change
    the underlying implementation.

    Currently we are processing user input (which can be really large) in
    batches which are stored to a temporarily allocated page. Each address
    is resolved to its struct page and stored to page_to_node structure
    along with the requested target numa node. The array of these
    structures is then conveyed down the page migration path via private
    argument. new_page_node then finds the corresponding structure and
    allocates the proper target page.

    What is the problem with the current implementation and why to change
    it? Apart from being quite ugly it also doesn't cope with unexpected
    pages showing up on the migration list inside migrate_pages path. That
    doesn't happen currently but the follow up patch would like to make the
    thp migration code more clear and that would need to split a THP into
    the list for some cases.

    How does the new implementation work? Well, instead of batching into a
    fixed size array we simply batch all pages that should be migrated to
    the same node and isolate all of them into a linked list which doesn't
    require any additional storage. This should work reasonably well
    because page migration usually migrates larger ranges of memory to a
    specific node. So the common case should work equally well as the
    current implementation. Even if somebody constructs an input where the
    target numa nodes would be interleaved we shouldn't see a large
    performance impact because page migration alone doesn't really benefit
    from batching. mmap_sem batching for the lookup is quite questionable
    and isolate_lru_page which would benefit from batching is not using it
    even in the current implementation.

    Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Andrew Morton
    Cc: Anshuman Khandual
    Cc: Zi Yan
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andrea Reale
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

30 Nov, 2017

1 commit

  • This reverts commit 152e93af3cfe2d29d8136cc0a02a8612507136ee.

    It was a nice cleanup in theory, but as Nicolai Stange points out, we do
    need to make the page dirty for the copy-on-write case even when we
    didn't end up making it writable, since the dirty bit is what we use to
    check that we've gone through a COW cycle.

    Reported-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Nov, 2017

1 commit

  • Currently we make page table entries dirty all the time regardless of
    access type and don't even consider if the mapping is write-protected.
    The reasoning is that we don't really need dirty tracking on THP and
    making the entry dirty upfront may save some time on first write to the
    page.

    Unfortunately, such approach may result in false-positive
    can_follow_write_pmd() for huge zero page or read-only shmem file.

    Let's only make page dirty only if we about to write to the page anyway
    (as we do for small pages).

    I've restructured the code to make entry dirty inside
    maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
    write-protected.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Nov, 2017

1 commit

  • Pageblock skip hints were added as a heuristic for compaction, which
    shares core code with CMA. Since CMA reliability would suffer from the
    heuristics, compact_control flag ignore_skip_hint was added for the CMA
    use case. Since 6815bf3f233e ("mm/compaction: respect ignore_skip_hint
    in update_pageblock_skip") the flag also means that CMA won't *update*
    the skip hints in addition to ignoring them.

    Today, direct compaction can also ignore the skip hints in the last
    resort attempt, but there's no reason not to set them when isolation
    fails in such case. Thus, this patch splits off a new no_set_skip_hint
    flag to avoid the updating, which only CMA sets. This should improve
    the heuristics a bit, and allow us to simplify the persistent skip bit
    handling as the next step.

    Link: http://lkml.kernel.org/r/20171102121706.21504-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

07 Sep, 2017

2 commits

  • For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
    victims and then, among other things, to give these threads full access
    to memory reserves. There are few shortcomings of this implementation,
    though.

    First of all and the most serious one is that the full access to memory
    reserves is quite dangerous because we leave no safety room for the
    system to operate and potentially do last emergency steps to move on.

    Secondly this flag is per task_struct while the OOM killer operates on
    mm_struct granularity so all processes sharing the given mm are killed.
    Giving the full access to all these task_structs could lead to a quick
    memory reserves depletion. We have tried to reduce this risk by giving
    TIF_MEMDIE only to the main thread and the currently allocating task but
    that doesn't really solve this problem while it surely opens up a room
    for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the
    allocator without access to memory reserves because a particular thread
    was not the group leader.

    Now that we have the oom reaper and that all oom victims are reapable
    after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
    kthreads") we can be more conservative and grant only partial access to
    memory reserves because there are reasonable chances of the parallel
    memory freeing. We still want some access to reserves because we do not
    want other consumers to eat up the victim's freed memory. oom victims
    will still contend with __GFP_HIGH users but those shouldn't be so
    aggressive to starve oom victims completely.

    Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
    the half of the reserves. This makes the access to reserves independent
    on which task has passed through mark_oom_victim. Also drop any usage
    of TIF_MEMDIE from the page allocator proper and replace it by
    tsk_is_oom_victim as well which will make page_alloc.c completely
    TIF_MEMDIE free finally.

    CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
    ALLOC_NO_WATERMARKS approach.

    There is a demand to make the oom killer memcg aware which will imply
    many tasks killed at once. This change will allow such a usecase
    without worrying about complete memory reserves depletion.

    Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • build_all_zonelists gets a zone parameter to initialize zone's pagesets.
    There is only a single user which gives a non-NULL zone parameter and
    that one doesn't really need the rest of the build_all_zonelists (see
    commit 6dcd73d7011b ("memory-hotplug: allocate zone's pcp before
    onlining pages")).

    Therefore remove setup_zone_pageset from build_all_zonelists and call it
    from its only user directly. This will also remove a pointless zonlists
    rebuilding which is always good.

    Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Wen Congyang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

03 Aug, 2017

1 commit

  • Nadav Amit identified a theoritical race between page reclaim and
    mprotect due to TLB flushes being batched outside of the PTL being held.

    He described the race as follows:

    CPU0 CPU1
    ---- ----
    user accesses memory using RW PTE
    [PTE now cached in TLB]
    try_to_unmap_one()
    ==> ptep_get_and_clear()
    ==> set_tlb_ubc_flush_pending()
    mprotect(addr, PROT_READ)
    ==> change_pte_range()
    ==> [ PTE non-present - no flush ]

    user writes using cached RW PTE
    ...

    try_to_unmap_flush()

    The same type of race exists for reads when protecting for PROT_NONE and
    also exists for operations that can leave an old TLB entry behind such
    as munmap, mremap and madvise.

    For some operations like mprotect, it's not necessarily a data integrity
    issue but it is a correctness issue as there is a window where an
    mprotect that limits access still allows access. For munmap, it's
    potentially a data integrity issue although the race is massive as an
    munmap, mmap and return to userspace must all complete between the
    window when reclaim drops the PTL and flushes the TLB. However, it's
    theoritically possible so handle this issue by flushing the mm if
    reclaim is potentially currently batching TLB flushes.

    Other instances where a flush is required for a present pte should be ok
    as either the page lock is held preventing parallel reclaim or a page
    reference count is elevated preventing a parallel free leading to
    corruption. In the case of page_mkclean there isn't an obvious path
    that userspace could take advantage of without using the operations that
    are guarded by this patch. Other users such as gup as a race with
    reclaim looks just at PTEs. huge page variants should be ok as they
    don't race with reclaim. mincore only looks at PTEs. userfault also
    should be ok as if a parallel reclaim takes place, it will either fault
    the page back in or read some of the data before the flush occurs
    triggering a fault.

    Note that a variant of this patch was acked by Andy Lutomirski but this
    was for the x86 parts on top of his PCID work which didn't make the 4.13
    merge window as expected. His ack is dropped from this version and
    there will be a follow-on patch on top of PCID that will include his
    ack.

    [akpm@linux-foundation.org: tweak comments]
    [akpm@linux-foundation.org: fix spello]
    Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
    Reported-by: Nadav Amit
    Signed-off-by: Mel Gorman
    Cc: Andy Lutomirski
    Cc: [v4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Jul, 2017

1 commit

  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 May, 2017

3 commits

  • The main goal of direct compaction is to form a high-order page for
    allocation, but it should also help against long-term fragmentation when
    possible.

    Most lower-than-pageblock-order compactions are for non-movable
    allocations, which means that if we compact in a movable pageblock and
    terminate as soon as we create the high-order page, it's unlikely that
    the fallback heuristics will claim the whole block. Instead there might
    be a single unmovable page in a pageblock full of movable pages, and the
    next unmovable allocation might pick another pageblock and increase
    long-term fragmentation.

    To help against such scenarios, this patch changes the termination
    criteria for compaction so that the current pageblock is finished even
    though the high-order page already exists. Note that it might be
    possible that the high-order page formed elsewhere in the zone due to
    parallel activity, but this patch doesn't try to detect that.

    This is only done with sync compaction, because async compaction is
    limited to pageblock of the same migratetype, where it cannot result in
    a migratetype fallback. (Async compaction also eagerly skips
    order-aligned blocks where isolation fails, which is against the goal of
    migrating away as much of the pageblock as possible.)

    As a result of this patch, long-term memory fragmentation should be
    reduced.

    In testing based on 4.9 kernel with stress-highalloc from mmtests
    configured for order-4 GFP_KERNEL allocations, this patch has reduced
    the number of unmovable allocations falling back to movable pageblocks
    by 20%. The number

    Link: http://lkml.kernel.org/r/20170307131545.28577-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Preparation patch. We are going to need migratetype at lower layers
    than compact_zone() and compact_finished().

    Link: http://lkml.kernel.org/r/20170307131545.28577-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "try to reduce fragmenting fallbacks", v3.

    Last year, Johannes Weiner has reported a regression in page mobility
    grouping [1] and while the exact cause was not found, I've come up with
    some ways to improve it by reducing the number of allocations falling
    back to different migratetype and causing permanent fragmentation.

    The series was tested with mmtests stress-highalloc modified to do
    GFP_KERNEL order-4 allocations, on 4.9 with "mm, vmscan: fix zone
    balance check in prepare_kswapd_sleep" (without that, kcompactd indeed
    wasn't woken up) on UMA machine with 4GB memory. There were 5 repeats
    of each run, as the extfrag stats are quite volatile (note the stats
    below are sums, not averages, as it was less perl hacking for me).

    Success rate are the same, already high due to the low allocation order
    used, so I'm not including them.

    Compaction stats:
    (the patches are stacked, and I haven't measured the non-functional-changes
    patches separately)

    patch 1 patch 2 patch 3 patch 4 patch 7 patch 8
    Compaction stalls 22449 24680 24846 19765 22059 17480
    Compaction success 12971 14836 14608 10475 11632 8757
    Compaction failures 9477 9843 10238 9290 10426 8722
    Page migrate success 3109022 3370438 3312164 1695105 1608435 2111379
    Page migrate failure 911588 1149065 1028264 1112675 1077251 1026367
    Compaction pages isolated 7242983 8015530 7782467 4629063 4402787 5377665
    Compaction migrate scanned 980838938 987367943 957690188 917647238 947155598 1018922197
    Compaction free scanned 557926893 598946443 602236894 594024490 541169699 763651731
    Compaction cost 10243 10578 10304 8286 8398 9440

    Compaction stats are mostly within noise until patch 4, which decreases
    the number of compactions, and migrations. Part of that could be due to
    more pageblocks marked as unmovable, and async compaction skipping
    those. This changes a bit with patch 7, but not so much. Patch 8
    increases free scanner stats and migrations, which comes from the
    changed termination criteria. Interestingly number of compactions
    decreases - probably the fully compacted pageblock satisfies multiple
    subsequent allocations, so it amortizes.

    Next comes the extfrag tracepoint, where "fragmenting" means that an
    allocation had to fallback to a pageblock of another migratetype which
    wasn't fully free (which is almost all of the fallbacks). I have
    locally added another tracepoint for "Page steal" into
    steal_suitable_fallback() which triggers in situations where we are
    allowed to do move_freepages_block(). If we decide to also do
    set_pageblock_migratetype(), it's "Pages steal with pageblock" with
    break down for which allocation migratetype we are stealing and from
    which fallback migratetype. The last part "due to counting" comes from
    patch 4 and counts the events where the counting of movable pages
    allowed us to change pageblock's migratetype, while the number of free
    pages alone wouldn't be enough to cross the threshold.

    patch 1 patch 2 patch 3 patch 4 patch 7 patch 8
    Page alloc extfrag event 10155066 8522968 10164959 15622080 13727068 13140319
    Extfrag fragmenting 10149231 8517025 10159040 15616925 13721391 13134792
    Extfrag fragmenting for unmovable 159504 168500 184177 97835 70625 56948
    Extfrag fragmenting unmovable placed with movable 153613 163549 172693 91740 64099 50917
    Extfrag fragmenting unmovable placed with reclaim. 5891 4951 11484 6095 6526 6031
    Extfrag fragmenting for reclaimable 4738 4829 6345 4822 5640 5378
    Extfrag fragmenting reclaimable placed with movable 1836 1902 1851 1579 1739 1760
    Extfrag fragmenting reclaimable placed with unmov. 2902 2927 4494 3243 3901 3618
    Extfrag fragmenting for movable 9984989 8343696 9968518 15514268 13645126 13072466
    Pages steal 179954 192291 210880 123254 94545 81486
    Pages steal with pageblock 22153 18943 20154 33562 29969 33444
    Pages steal with pageblock for unmovable 14350 12858 13256 20660 19003 20852
    Pages steal with pageblock for unmovable from mov. 12812 11402 11683 19072 17467 19298
    Pages steal with pageblock for unmovable from recl. 1538 1456 1573 1588 1536 1554
    Pages steal with pageblock for movable 7114 5489 5965 11787 10012 11493
    Pages steal with pageblock for movable from unmov. 6885 5291 5541 11179 9525 10885
    Pages steal with pageblock for movable from recl. 229 198 424 608 487 608
    Pages steal with pageblock for reclaimable 689 596 933 1115 954 1099
    Pages steal with pageblock for reclaimable from unmov. 273 219 537 658 547 667
    Pages steal with pageblock for reclaimable from mov. 416 377 396 457 407 432
    Pages steal with pageblock due to counting 11834 10075 7530
    ... for unmovable 8993 7381 4616
    ... for movable 2792 2653 2851
    ... for reclaimable 49 41 63

    What we can see is that "Extfrag fragmenting for unmovable" and "...
    placed with movable" drops with almost each patch, which is good as we
    are polluting less movable pageblocks with unmovable pages.

    The most significant change is patch 4 with movable page counting. On
    the other hand it increases "Extfrag fragmenting for movable" by 50%.
    "Pages steal" drops though, so these movable allocation fallbacks find
    only small free pages and are not allowed to steal whole pageblocks
    back. "Pages steal with pageblock" raises, because the patch increases
    the chances of pageblock migratetype changes to happen. This affects
    all migratetypes.

    The summary is that patch 4 is not a clear win wrt these stats, but I
    believe that the tradeoff it makes is a good one. There's less
    pollution of movable pageblocks by unmovable allocations. There's less
    stealing between pageblock, and those that remain have higher chance of
    changing migratetype also the pageblock itself, so it should more
    faithfully reflect the migratetype of the pages within the pageblock.
    The increase of movable allocations falling back to unmovable pageblock
    might look dramatic, but those allocations can be migrated by compaction
    when needed, and other patches in the series (7-9) improve that aspect.

    Patches 7 and 8 continue the trend of reduced unmovable fallbacks and
    also reduce the impact on movable fallbacks from patch 4.

    [1] https://www.spinics.net/lists/linux-mm/msg114237.html

    This patch (of 8):

    While currently there are (mostly by accident) no holes in struct
    compact_control (on x86_64), but we are going to add more bool flags, so
    place them all together to the end of the structure. While at it, just
    order all fields from largest to smallest.

    Link: http://lkml.kernel.org/r/20170307131545.28577-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

04 May, 2017

3 commits

  • Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().

    Simplify the code, no functional changes.

    [akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
    Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.com
    Signed-off-by: Xishi Qiu
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • NR_PAGES_SCANNED counts number of pages scanned since the last page free
    event in the allocator. This was used primarily to measure the
    reclaimability of zones and nodes, and determine when reclaim should
    give up on them. In that role, it has been replaced in the preceding
    patches by a different mechanism.

    Being implemented as an efficient vmstat counter, it was automatically
    exported to userspace as well. It's however unlikely that anyone
    outside the kernel is using this counter in any meaningful way.

    Remove the counter and the unused pgdat_reclaimable().

    Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Jia He
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
    cleanups".

    Jia reported a scenario in which the kswapd of a node indefinitely spins
    at 100% CPU usage. We have seen similar cases at Facebook.

    The kernel's current method of judging its ability to reclaim a node (or
    whether to back off and sleep) is based on the amount of scanned pages
    in proportion to the amount of reclaimable pages. In Jia's and our
    scenarios, there are no reclaimable pages in the node, however, and the
    condition for backing off is never met. Kswapd busyloops in an attempt
    to restore the watermarks while having nothing to work with.

    This series reworks the definition of an unreclaimable node based not on
    scanning but on whether kswapd is able to actually reclaim pages in
    MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
    the page allocator uses for giving up on direct reclaim and invoking the
    OOM killer. If it cannot free any pages, kswapd will go to sleep and
    leave further attempts to direct reclaim invocations, which will either
    make progress and re-enable kswapd, or invoke the OOM killer.

    Patch #1 fixes the immediate problem Jia reported, the remainder are
    smaller fixlets, cleanups, and overall phasing out of the old method.

    Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
    and directly related to #5, but in itself not relevant to the series.

    If the whole series is too ambitious for 4.11, I would consider the
    first three patches fixes, the rest cleanups.

    This patch (of 9):

    Jia He reports a problem with kswapd spinning at 100% CPU when
    requesting more hugepages than memory available in the system:

    $ echo 4000 >/proc/sys/vm/nr_hugepages

    top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
    Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
    %Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
    KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
    KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3

    At that time, there are no reclaimable pages left in the node, but as
    kswapd fails to restore the high watermarks it refuses to go to sleep.

    Kswapd needs to back away from nodes that fail to balance. Up until
    commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
    nodes") kswapd had such a mechanism. It considered zones whose
    theoretically reclaimable pages it had reclaimed six times over as
    unreclaimable and backed away from them. This guard was erroneously
    removed as the patch changed the definition of a balanced node.

    However, simply restoring this code wouldn't help in the case reported
    here: there *are* no reclaimable pages that could be scanned until the
    threshold is met. Kswapd would stay awake anyway.

    Introduce a new and much simpler way of backing off. If kswapd runs
    through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
    page, make it back off from the node. This is the same number of shots
    direct reclaim takes before declaring OOM. Kswapd will go to sleep on
    that node until a direct reclaimer manages to reclaim some pages, thus
    proving the node reclaimable again.

    [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
    Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
    [shakeelb@google.com: fix condition for throttle_direct_reclaim]
    Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Shakeel Butt
    Reported-by: Jia He
    Tested-by: Jia He
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Apr, 2017

1 commit

  • We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
    vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
    per cpu lru caches. This seems more than necessary because both can run
    on a single WQ. Both do not block on locks requiring a memory
    allocation nor perform any allocations themselves. We will save one
    rescuer thread this way.

    On the other hand drain_all_pages() queues work on the system wq which
    doesn't have rescuer and so this depend on memory allocation (when all
    workers are stuck allocating and new ones cannot be created).

    Initially we thought this would be more of a theoretical problem but
    Hugh Dickins has reported:

    : 4.11-rc has been giving me hangs after hours of swapping load. At
    : first they looked like memory leaks ("fork: Cannot allocate memory");
    : but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
    : before looking at /proc/meminfo one time, and the stat_refresh stuck
    : in D state, waiting for completion of flush_work like many kworkers.
    : kthreadd waiting for completion of flush_work in drain_all_pages().

    This worker should be using WQ_RECLAIM as well in order to guarantee a
    forward progress. We can reuse the same one as for lru draining and
    vmstat.

    Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Tetsuo Handa
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Tested-by: Yang Li
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Feb, 2017

1 commit

  • Current rmap code can miss a VMA that maps PTE-mapped THP if the first
    suppage of the THP was unmapped from the VMA.

    We need to walk rmap for the whole range of offsets that THP covers, not
    only the first one.

    vma_address() also need to be corrected to check the range instead of
    the first subpage.

    Link: http://lkml.kernel.org/r/20170129173858.45174-6-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov