15 Aug, 2020

2 commits

  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • This function returns the number of bytes in a THP. It is like
    page_size(), but compiles to just PAGE_SIZE if CONFIG_TRANSPARENT_HUGEPAGE
    is disabled.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: David Hildenbrand
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-5-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

3 commits

  • There is a well-defined migration target allocation callback. Use it.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-7-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There are some similar functions for migration target allocation. Since
    there is no fundamental difference, it's better to keep just one rather
    than keeping all variants. This patch implements base migration target
    allocation function. In the following patches, variants will be converted
    to use this function.

    Changes should be mechanical, but, unfortunately, there are some
    differences. First, some callers' nodemask is assgined to NULL since NULL
    nodemask will be considered as all available nodes, that is,
    &node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is
    redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
    user provided gfp_mask has it. This is because future caller of this
    function requires to set this node constaint. Lastly, if provided nodeid
    is NUMA_NO_NODE, nodeid is set up to the node where migration source
    lives. It helps to remove simple wrappers for setting up the nodeid.

    Note that PageHighmem() call in previous function is changed to open-code
    "is_highmem_idx()" since it provides more readability.

    [akpm@linux-foundation.org: tweak patch title, per Vlastimil]
    [akpm@linux-foundation.org: fix typo in comment]

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • For some applications, we need to allocate almost all memory as hugepages.
    However, on a running system, higher-order allocations can fail if the
    memory is fragmented. Linux kernel currently does on-demand compaction as
    we request more hugepages, but this style of compaction incurs very high
    latency. Experiments with one-time full memory compaction (followed by
    hugepage allocations) show that kernel is able to restore a highly
    fragmented memory state to a fairly compacted memory state within 98% of free memory could be allocated as hugepages)

    - With 5.6.0-rc3 + this patch, with proactiveness=20

    sysctl -w vm.compaction_proactiveness=20

    percentile latency
    –––––––––– –––––––
    5 2
    10 2
    25 3
    30 3
    40 3
    50 4
    60 4
    75 4
    80 4
    90 5
    95 429

    Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
    total free => 98% of free memory could be allocated as hugepages)

    2. JAVA heap allocation

    In this test, we first fragment memory using the same method as for (1).

    Then, we start a Java process with a heap size set to 700G and request the
    heap to be allocated with THP hugepages. We also set THP to madvise to
    allow hugepage backing of this heap.

    /usr/bin/time
    java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

    The above command allocates 700G of Java heap using hugepages.

    - With vanilla 5.6.0-rc3

    17.39user 1666.48system 27:37.89elapsed

    - With 5.6.0-rc3 + this patch, with proactiveness=20

    8.35user 194.58system 3:19.62elapsed

    Elapsed time remains around 3:15, as proactiveness is further increased.

    Note that proactive compaction happens throughout the runtime of these
    workloads. The situation of one-time compaction, sufficient to supply
    hugepages for following allocation stream, can probably happen for more
    extreme proactiveness values, like 80 or 90.

    In the above Java workload, proactiveness is set to 20. The test starts
    with a node's score of 80 or higher, depending on the delay between the
    fragmentation step and starting the benchmark, which gives more-or-less
    time for the initial round of compaction. As t he benchmark consumes
    hugepages, node's score quickly rises above the high threshold (90) and
    proactive compaction starts again, which brings down the score to the low
    threshold level (80). Repeat.

    bpftrace also confirms proactive compaction running 20+ times during the
    runtime of this Java benchmark. kcompactd threads consume 100% of one of
    the CPUs while it tries to bring a node's score within thresholds.

    Backoff behavior
    ================

    Above workloads produce a memory state which is easy to compact. However,
    if memory is filled with unmovable pages, proactive compaction should
    essentially back off. To test this aspect:

    - Created a kernel driver that allocates almost all memory as hugepages
    followed by freeing first 3/4 of each hugepage.
    - Set proactiveness=40
    - Note that proactive_compact_node() is deferred maximum number of times
    with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
    (=> ~30 seconds between retries).

    [1] https://patchwork.kernel.org/patch/11098289/
    [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
    [3] https://lwn.net/Articles/817905/

    Signed-off-by: Nitin Gupta
    Signed-off-by: Andrew Morton
    Tested-by: Oleksandr Natalenko
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Khalid Aziz
    Reviewed-by: Oleksandr Natalenko
    Cc: Vlastimil Babka
    Cc: Khalid Aziz
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Mike Kravetz
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Nitin Gupta
    Cc: Oleksandr Natalenko
    Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
    Signed-off-by: Linus Torvalds

    Nitin Gupta
     

10 Jun, 2020

2 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

05 Jun, 2020

1 commit


04 Jun, 2020

3 commits

  • commit 3c710c1ad11b ("mm, vmscan extract shrink_page_list reclaim counters
    into a struct") changed data type for the function, so changing return
    type for funciton and its caller.

    Signed-off-by: Vaneet Narang
    Signed-off-by: Maninder Singh
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Amit Sahrawat
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/1588168259-25604-1-git-send-email-maninder1.s@samsung.com
    Signed-off-by: Linus Torvalds

    Maninder Singh
     
  • classzone_idx is just different name for high_zoneidx now. So, integrate
    them and add some comment to struct alloc_context in order to reduce
    future confusion about the meaning of this variable.

    The accessor, ac_classzone_idx() is also removed since it isn't needed
    after integration.

    In addition to integration, this patch also renames high_zoneidx to
    highest_zoneidx since it represents more precise meaning.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Acked-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Ye Xiaolong
    Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Patch series "integrate classzone_idx and high_zoneidx", v5.

    This patchset is followup of the problem reported and discussed two years
    ago [1, 2]. The problem this patchset solves is related to the
    classzone_idx on the NUMA system. It causes a problem when the lowmem
    reserve protection exists for some zones on a node that do not exist on
    other nodes.

    This problem was reported two years ago, and, at that time, the solution
    got general agreements [2]. But it was not upstreamed.

    [1]: http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
    [2]: http://lkml.kernel.org/r/1525408246-14768-1-git-send-email-iamjoonsoo.kim@lge.com

    This patch (of 2):

    Currently, we use classzone_idx to calculate lowmem reserve proetection
    for an allocation request. This classzone_idx causes a problem on NUMA
    systems when the lowmem reserve protection exists for some zones on a node
    that do not exist on other nodes.

    Before further explanation, I should first clarify how to compute the
    classzone_idx and the high_zoneidx.

    - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
    represents the index of the highest zone the allocation can use

    - classzone_idx was supposed to be the index of the highest zone on the
    local node that the allocation can use, that is actually available in
    the system

    Think about following example. Node 0 has 4 populated zone,
    DMA/DMA32/NORMAL/MOVABLE. Node 1 has 1 populated zone, NORMAL. Some
    zones, such as MOVABLE, doesn't exist on node 1 and this makes following
    difference.

    Assume that there is an allocation request whose gfp_zone(gfp_mask) is the
    zone, MOVABLE. Then, it's high_zoneidx is 3. If this allocation is
    initiated on node 0, it's classzone_idx is 3 since actually
    available/usable zone on local (node 0) is MOVABLE. If this allocation is
    initiated on node 1, it's classzone_idx is 2 since actually
    available/usable zone on local (node 1) is NORMAL.

    You can see that classzone_idx of the allocation request are different
    according to their starting node, even if their high_zoneidx is the same.

    Think more about these two allocation requests. If they are processed on
    local, there is no problem. However, if allocation is initiated on node 1
    are processed on remote, in this example, at the NORMAL zone on node 0,
    due to memory shortage, problem occurs. Their different classzone_idx
    leads to different lowmem reserve and then different min watermark. See
    the following example.

    root@ubuntu:/sys/devices/system/memory# cat /proc/zoneinfo
    Node 0, zone DMA
    per-node stats
    ...
    pages free 3965
    min 5
    low 8
    high 11
    spanned 4095
    present 3998
    managed 3977
    protection: (0, 2961, 4928, 5440)
    ...
    Node 0, zone DMA32
    pages free 757955
    min 1129
    low 1887
    high 2645
    spanned 1044480
    present 782303
    managed 758116
    protection: (0, 0, 1967, 2479)
    ...
    Node 0, zone Normal
    pages free 459806
    min 750
    low 1253
    high 1756
    spanned 524288
    present 524288
    managed 503620
    protection: (0, 0, 0, 4096)
    ...
    Node 0, zone Movable
    pages free 130759
    min 195
    low 326
    high 457
    spanned 1966079
    present 131072
    managed 131072
    protection: (0, 0, 0, 0)
    ...
    Node 1, zone DMA
    pages free 0
    min 0
    low 0
    high 0
    spanned 0
    present 0
    managed 0
    protection: (0, 0, 1006, 1006)
    Node 1, zone DMA32
    pages free 0
    min 0
    low 0
    high 0
    spanned 0
    present 0
    managed 0
    protection: (0, 0, 1006, 1006)
    Node 1, zone Normal
    per-node stats
    ...
    pages free 233277
    min 383
    low 640
    high 897
    spanned 262144
    present 262144
    managed 257744
    protection: (0, 0, 0, 0)
    ...
    Node 1, zone Movable
    pages free 0
    min 0
    low 0
    high 0
    spanned 262144
    present 0
    managed 0
    protection: (0, 0, 0, 0)

    - static min watermark for the NORMAL zone on node 0 is 750.

    - lowmem reserve for the request with classzone idx 3 at the NORMAL on
    node 0 is 4096.

    - lowmem reserve for the request with classzone idx 2 at the NORMAL on
    node 0 is 0.

    So, overall min watermark is:
    allocation initiated on node 0 (classzone_idx 3): 750 + 4096 = 4846
    allocation initiated on node 1 (classzone_idx 2): 750 + 0 = 750

    Allocation initiated on node 1 will have some precedence than allocation
    initiated on node 0 because min watermark of the former allocation is
    lower than the other. So, allocation initiated on node 1 could succeed on
    node 0 when allocation initiated on node 0 could not, and, this could
    cause too many numa_miss allocation. Then, performance could be
    downgraded.

    Recently, there was a regression report about this problem on CMA patches
    since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
    that problem is disappeared with this fix that uses high_zoneidx for
    classzone_idx.

    http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop

    Using high_zoneidx for classzone_idx is more consistent way than previous
    approach because system's memory layout doesn't affect anything to it.
    With this patch, both classzone_idx on above example will be 3 so will
    have the same min watermark.

    allocation initiated on node 0: 750 + 4096 = 4846
    allocation initiated on node 1: 750 + 4096 = 4846

    One could wonder if there is a side effect that allocation initiated on
    node 1 will use higher bar when allocation is handled on local since
    classzone_idx could be higher than before. It will not happen because the
    zone without managed page doesn't contributes lowmem_reserve at all.

    Reported-by: Ye Xiaolong
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Tested-by: Ye Xiaolong
    Reviewed-by: Baoquan He
    Acked-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Mel Gorman
    Link: http://lkml.kernel.org/r/1587095923-7515-1-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1587095923-7515-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

03 Jun, 2020

2 commits

  • ondemand_readahead has two callers, neither of which use the return
    value. That means that both ra_submit and __do_page_cache_readahead()
    can return void, and we don't need to worry that a present page in the
    readahead window causes us to return a smaller nr_pages than we ought to
    have.

    Similarly, no caller uses the return value from
    force_page_cache_readahead().

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Dave Chinner
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-3-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Patch series "Change readahead API", v11.

    This series adds a readahead address_space operation to replace the
    readpages operation. The key difference is that pages are added to the
    page cache as they are allocated (and then looked up by the filesystem)
    instead of passing them on a list to the readpages operation and having
    the filesystem add them to the page cache. It's a net reduction in code
    for each implementation, more efficient than walking a list, and solves
    the direct-write vs buffered-read problem reported by yu kuai at
    http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.com

    The only unconverted filesystems are those which use fscache. Their
    conversion is pending Dave Howells' rewrite which will make the
    conversion substantially easier. This should be completed by the end of
    the year.

    I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric
    Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and
    Miklos Szeredi have done a marvellous job of providing constructive
    criticism.

    These patches pass an xfstests run on ext4, xfs & btrfs with no
    regressions that I can tell (some of the tests seem a little flaky
    before and remain flaky afterwards).

    This patch (of 25):

    The readahead code is part of the page cache so should be found in the
    pagemap.h file. force_page_cache_readahead is only used within mm, so
    move it to mm/internal.h instead. Remove the parameter names where they
    add no value, and rename the ones which were actively misleading.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Reviewed-by: Johannes Thumshirn
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org
    Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

08 Apr, 2020

1 commit

  • There are cases where we would benefit from avoiding having to go through
    the allocation and free cycle to return an isolated page.

    Examples for this might include page poisoning in which we isolate a page
    and then put it back in the free list without ever having actually
    allocated it.

    This will enable us to also avoid notifiers for the future free page
    reporting which will need to avoid retriggering page reporting when
    returning pages that have been reported on.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Konrad Rzeszutek Wilk
    Cc: Luiz Capitulino
    Cc: Matthew Wilcox
    Cc: Michael S. Tsirkin
    Cc: Michal Hocko
    Cc: Nitesh Narayan Lal
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Wei Wang
    Cc: Yang Zhang
    Cc: wei qi
    Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomain
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     

03 Apr, 2020

4 commits

  • Patch series "fix THP migration for CMA allocations", v2.

    Transparent huge pages are allocated with __GFP_MOVABLE, and can end up in
    CMA memory blocks. Transparent huge pages also have most of the
    infrastructure in place to allow migration.

    However, a few pieces were missing, causing THP migration to fail when
    attempting to use CMA to allocate 1GB hugepages.

    With these patches in place, THP migration from CMA blocks seems to work,
    both for anonymous THPs and for tmpfs/shmem THPs.

    This patch (of 2):

    Add information to struct compact_control to indicate that the allocator
    would really like to clear out this specific part of memory, used by for
    example CMA.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200227213238.1298752-1-riel@surriel.com
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • This patch makes ALLOC_KSWAPD equal to __GFP_KSWAPD_RECLAIM (cast to int).

    Thanks to that code like:

    if (gfp_mask & __GFP_KSWAPD_RECLAIM)
    alloc_flags |= ALLOC_KSWAPD;

    can be changed to:

    alloc_flags |= (__force int) (gfp_mask &__GFP_KSWAPD_RECLAIM);

    Thanks to this one branch less is generated in the assembly.

    In case of ALLOC_KSWAPD flag two branches are saved, first one in code
    that always executes in the beginning of page allocation and the second
    one in loop in page allocator slowpath.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Link: http://lkml.kernel.org/r/20200304162118.14784-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • The idea comes from a discussion between Linus and Andrea [1].

    Before this patch we only allow a page fault to retry once. We achieved
    this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
    handle_mm_fault() the second time. This was majorly used to avoid
    unexpected starvation of the system by looping over forever to handle the
    page fault on a single page. However that should hardly happen, and after
    all for each code path to return a VM_FAULT_RETRY we'll first wait for a
    condition (during which time we should possibly yield the cpu) to happen
    before VM_FAULT_RETRY is really returned.

    This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
    flag when we receive VM_FAULT_RETRY. It means that the page fault handler
    now can retry the page fault for multiple times if necessary without the
    need to generate another page fault event. Meanwhile we still keep the
    FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
    page fault is the first attempt or not.

    Then we'll have these combinations of fault flags (only considering
    ALLOW_RETRY flag and TRIED flag):

    - ALLOW_RETRY and !TRIED: this means the page fault allows to
    retry, and this is the first try

    - ALLOW_RETRY and TRIED: this means the page fault allows to
    retry, and this is not the first try

    - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
    to retry at all

    - !ALLOW_RETRY and TRIED: this is forbidden and should never be used

    In existing code we have multiple places that has taken special care of
    the first condition above by checking against (fault_flags &
    FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect
    the first retry of a page fault by checking against both (fault_flags &
    FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
    even the 2nd try will have the ALLOW_RETRY set, then use that helper in
    all existing special paths. One example is in __lock_page_or_retry(), now
    we'll drop the mmap_sem only in the first attempt of page fault and we'll
    keep it in follow up retries, so old locking behavior will be retained.

    This will be a nice enhancement for current code [2] at the same time a
    supporting material for the future userfaultfd-writeprotect work, since in
    that work there will always be an explicit userfault writeprotect retry
    for protected pages, and if that cannot resolve the page fault (e.g., when
    userfaultfd-writeprotect is used in conjunction with swapped pages) then
    we'll possibly need a 3rd retry of the page fault. It might also benefit
    other potential users who will have similar requirement like userfault
    write-protection.

    GUP code is not touched yet and will be covered in follow up patch.

    Please read the thread below for more information.

    [1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
    [2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/

    Suggested-by: Linus Torvalds
    Suggested-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Tested-by: Brian Geffon
    Cc: Bobby Powers
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Matthew Wilcox
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping
    pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with
    a couple of vm-scalability's test cases (lru-file-readonce,
    lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
    on my VM (32c-64g-2nodes). It might be caused by the test configuration,
    which is 32c-256g with NUMA disabled and the tests were run in root memcg,
    so the tests actually stress only one inactive and active lru. It sounds
    not very usual in mordern production environment.

    That commit did two major changes:
    1. Call page_evictable()
    2. Use smp_mb to force the PG_lru set visible

    It looks they contribute the most overhead. The page_evictable() is a
    function which does function prologue and epilogue, and that was used by
    page reclaim path only. However, lru add is a very hot path, so it sounds
    better to make it inline. However, it calls page_mapping() which is not
    inlined either, but the disassemble shows it doesn't do push and pop
    operations and it sounds not very straightforward to inline it.

    Other than this, it sounds smp_mb() is not necessary for x86 since
    SetPageLRU is atomic which enforces memory barrier already, replace it
    with smp_mb__after_atomic() in the following patch.

    With the two fixes applied, the tests can get back around 5% on that test
    bench and get back normal on my VM. Since the test bench configuration is
    not that usual and I also saw around 6% up on the latest upstream, so it
    sounds good enough IMHO.

    The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
    mainline w/ inline fix
    150MB 154MB

    With this patch the throughput gets 2.67% up. The data with using
    smp_mb__after_atomic() is showed in the following patch.

    Shakeel Butt did the below test:

    On a real machine with limiting the 'dd' on a single node and reading 100
    GiB sparse file (less than a single node). Just ran a single instance to
    not cause the lru lock contention. The cmdline used is "dd if=file-100GiB
    of=/dev/null bs=4k". Ran the cmd 10 times with drop_caches in between and
    measured the time it took.

    Without patch: 56.64143 +- 0.672 sec

    With patches: 56.10 +- 0.21 sec

    [akpm@linux-foundation.org: move page_evictable() to internal.h]
    Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Tested-by: Shakeel Butt
    Reviewed-by: Shakeel Butt
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     

02 Dec, 2019

1 commit

  • Memory hotplug needs to be able to reset and reinit the pcpu allocator
    batch and high limits but this action is internal to the VM. Move the
    declaration to internal.h

    Link: http://lkml.kernel.org/r/20191021094808.28824-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Borislav Petkov
    Cc: Matt Fleming
    Cc: Qian Cai
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Dec, 2019

3 commits

  • Now we use rb_parent to get next, while this is not necessary.

    When prev is NULL, this means vma should be the first element in the list.
    Then next should be current first one (mm->mmap), no matter whether we
    have parent or not.

    After removing it, the code shows the beauty of symmetry.

    Link: http://lkml.kernel.org/r/20190813032656.16625-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Acked-by: Andrew Morton
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Just make the code a little easier to read.

    Link: http://lkml.kernel.org/r/20191006012636.31521-3-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox (Oracle)
    Cc: Mel Gorman
    Cc: Oscar Salvador
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • One of our services is observing hanging ps/top/etc under heavy write
    IO, and the task states show this is an mmap_sem priority inversion:

    A write fault is holding the mmap_sem in read-mode and waiting for
    (heavily cgroup-limited) IO in balance_dirty_pages():

    balance_dirty_pages+0x724/0x905
    balance_dirty_pages_ratelimited+0x254/0x390
    fault_dirty_shared_page.isra.96+0x4a/0x90
    do_wp_page+0x33e/0x400
    __handle_mm_fault+0x6f0/0xfa0
    handle_mm_fault+0xe4/0x200
    __do_page_fault+0x22b/0x4a0
    page_fault+0x45/0x50

    Somebody tries to change the address space, contending for the mmap_sem in
    write-mode:

    call_rwsem_down_write_failed_killable+0x13/0x20
    do_mprotect_pkey+0xa8/0x330
    SyS_mprotect+0xf/0x20
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The waiting writer locks out all subsequent readers to avoid lock
    starvation, and several threads can be seen hanging like this:

    call_rwsem_down_read_failed+0x14/0x30
    proc_pid_cmdline_read+0xa0/0x480
    __vfs_read+0x23/0x140
    vfs_read+0x87/0x130
    SyS_read+0x42/0x90
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    To fix this, do what we do for cache read faults already: drop the
    mmap_sem before calling into anything IO bound, in this case the
    balance_dirty_pages() function, and return VM_FAULT_RETRY.

    Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Kirill A. Shutemov
    Cc: Josef Bacik
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

26 Sep, 2019

1 commit

  • Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.

    - Background

    The Android terminology used for forking a new process and starting an app
    from scratch is a cold start, while resuming an existing app is a hot
    start. While we continually try to improve the performance of cold
    starts, hot starts will always be significantly less power hungry as well
    as faster so we are trying to make hot start more likely than cold start.

    To increase hot start, Android userspace manages the order that apps
    should be killed in a process called ActivityManagerService.
    ActivityManagerService tracks every Android app or service that the user
    could be interacting with at any time and translates that into a ranked
    list for lmkd(low memory killer daemon). They are likely to be killed by
    lmkd if the system has to reclaim memory. In that sense they are similar
    to entries in any other cache. Those apps are kept alive for
    opportunistic performance improvements but those performance improvements
    will vary based on the memory requirements of individual workloads.

    - Problem

    Naturally, cached apps were dominant consumers of memory on the system.
    However, they were not significant consumers of swap even though they are
    good candidate for swap. Under investigation, swapping out only begins
    once the low zone watermark is hit and kswapd wakes up, but the overall
    allocation rate in the system might trip lmkd thresholds and cause a
    cached process to be killed(we measured performance swapping out vs.
    zapping the memory by killing a process. Unsurprisingly, zapping is 10x
    times faster even though we use zram which is much faster than real
    storage) so kill from lmkd will often satisfy the high zone watermark,
    resulting in very few pages actually being moved to swap.

    - Approach

    The approach we chose was to use a new interface to allow userspace to
    proactively reclaim entire processes by leveraging platform information.
    This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
    that are known to be cold from userspace and to avoid races with lmkd by
    reclaiming apps as soon as they entered the cached state. Additionally,
    it could provide many chances for platform to use much information to
    optimize memory efficiency.

    To achieve the goal, the patchset introduce two new options for madvise.
    One is MADV_COLD which will deactivate activated pages and the other is
    MADV_PAGEOUT which will reclaim private pages instantly. These new
    options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
    ways to gain some free memory space. MADV_PAGEOUT is similar to
    MADV_DONTNEED in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed immediately; MADV_COLD is similar
    to MADV_FREE in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed when memory pressure rises.

    This patch (of 5):

    When a process expects no accesses to a certain memory range, it could
    give a hint to kernel that the pages can be reclaimed when memory pressure
    happens but data should be preserved for future use. This could reduce
    workingset eviction so it ends up increasing performance.

    This patch introduces the new MADV_COLD hint to madvise(2) syscall.
    MADV_COLD can be used by a process to mark a memory range as not expected
    to be used in the near future. The hint can help kernel in deciding which
    pages to evict early during memory pressure.

    It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

    active file page -> inactive file LRU
    active anon page -> inacdtive anon LRU

    Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
    LRU's head because MADV_COLD is a little bit different symantic.
    MADV_FREE means it's okay to discard when the memory pressure because the
    content of the page is *garbage* so freeing such pages is almost zero
    overhead since we don't need to swap out and access afterward causes just
    minor fault. Thus, it would make sense to put those freeable pages in
    inactive file LRU to compete other used-once pages. It makes sense for
    implmentaion point of view, too because it's not swapbacked memory any
    longer until it would be re-dirtied. Even, it could give a bonus to make
    them be reclaimed on swapless system. However, MADV_COLD doesn't mean
    garbage so reclaiming them requires swap-out/in in the end so it's bigger
    cost. Since we have designed VM LRU aging based on cost-model, anonymous
    cold pages would be better to position inactive anon's LRU list, not file
    LRU. Furthermore, it would help to avoid unnecessary scanning if system
    doesn't have a swap device. Let's start simpler way without adding
    complexity at this moment. However, keep in mind, too that it's a caveat
    that workloads with a lot of pages cache are likely to ignore MADV_COLD on
    anonymous memory because we rarely age anonymous LRU lists.

    * man-page material

    MADV_COLD (since Linux x.x)

    Pages in the specified regions will be treated as less-recently-accessed
    compared to pages in the system with similar access frequencies. In
    contrast to MADV_FREE, the contents of the region are preserved regardless
    of subsequent writes to pages.

    MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
    pages.

    [akpm@linux-foundation.org: resolve conflicts with hmm.git]
    Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: kbuild test robot
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: James E.J. Bottomley
    Cc: Richard Henderson
    Cc: Ralf Baechle
    Cc: Chris Zankel
    Cc: Johannes Weiner
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Joel Fernandes (Google)
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 Mar, 2019

9 commits

  • Compaction is inherently race-prone as a suitable page freed during
    compaction can be allocated by any parallel task. This patch uses a
    capture_control structure to isolate a page immediately when it is freed
    by a direct compactor in the slow path of the page allocator. The
    intent is to avoid redundant scanning.

    5.0.0-rc1 5.0.0-rc1
    selective-v3r17 capture-v3r19
    Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
    Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%)
    Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%)
    Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%)
    Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%)
    Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%)
    Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%*
    Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%)
    Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%)

    Latency is only moderately affected but the devil is in the details. A
    closer examination indicates that base page fault latency is reduced but
    latency of huge pages is increased as it takes creater care to succeed.
    Part of the "problem" is that allocation success rates are close to 100%
    even when under pressure and compaction gets harder

    5.0.0-rc1 5.0.0-rc1
    selective-v3r17 capture-v3r19
    Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%)
    Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%)
    Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%)
    Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%)
    Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%)
    Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%)
    Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%)
    Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%)

    And scan rates are reduced as expected by 6% for the migration scanner
    and 29% for the free scanner indicating that there is less redundant
    work.

    Compaction migrate scanned 20815362 19573286
    Compaction free scanned 16352612 11510663

    [mgorman@techsingularity.net: remove redundant check]
    Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
    Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As compaction proceeds and creates high-order blocks, the free list
    search gets less efficient as the larger blocks are used as compaction
    targets. Eventually, the larger blocks will be behind the migration
    scanner for partially migrated pageblocks and the search fails. This
    patch round-robins what orders are searched so that larger blocks can be
    ignored and find smaller blocks that can be used as migration targets.

    The overall impact was small on 1-socket but it avoids corner cases
    where the migration/free scanners meet prematurely or situations where
    many of the pageblocks encountered by the free scanner are almost full
    instead of being properly packed. Previous testing had indicated that
    without this patch there were occasional large spikes in the free
    scanner without this patch.

    [dan.carpenter@oracle.com: fix static checker warning]
    Link: http://lkml.kernel.org/r/20190118175136.31341-20-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Pageblocks are marked for skip when no pages are isolated after a scan.
    However, it's possible to hit corner cases where the migration scanner
    gets stuck near the boundary between the source and target scanner. Due
    to pages being migrated in blocks of COMPACT_CLUSTER_MAX, pages that are
    migrated can be reallocated before the pageblock is complete. The
    pageblock is not necessarily skipped so it can be rescanned multiple
    times. Similarly, a pageblock with some dirty/writeback pages may fail
    to migrate and be rescanned until writeback completes which is wasteful.

    This patch tracks if a pageblock is being rescanned. If so, then the
    entire pageblock will be migrated as one operation. This narrows the
    race window during which pages can be reallocated during migration.
    Secondly, if there are pages that cannot be isolated then the pageblock
    will still be fully scanned and marked for skipping. On the second
    rescan, the pageblock skip is set and the migration scanner makes
    progress.

    5.0.0-rc1 5.0.0-rc1
    findfree-v3r16 norescan-v3r16
    Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
    Amean fault-both-3 3200.68 ( 0.00%) 3002.07 ( 6.21%)
    Amean fault-both-5 4847.75 ( 0.00%) 4684.47 ( 3.37%)
    Amean fault-both-7 6658.92 ( 0.00%) 6815.54 ( -2.35%)
    Amean fault-both-12 11077.62 ( 0.00%) 10864.02 ( 1.93%)
    Amean fault-both-18 12403.97 ( 0.00%) 12247.52 ( 1.26%)
    Amean fault-both-24 15607.10 ( 0.00%) 15683.99 ( -0.49%)
    Amean fault-both-30 18752.27 ( 0.00%) 18620.02 ( 0.71%)
    Amean fault-both-32 21207.54 ( 0.00%) 19250.28 * 9.23%*

    5.0.0-rc1 5.0.0-rc1
    findfree-v3r16 norescan-v3r16
    Percentage huge-3 96.86 ( 0.00%) 95.00 ( -1.91%)
    Percentage huge-5 93.72 ( 0.00%) 94.22 ( 0.53%)
    Percentage huge-7 94.31 ( 0.00%) 92.35 ( -2.08%)
    Percentage huge-12 92.66 ( 0.00%) 91.90 ( -0.82%)
    Percentage huge-18 91.51 ( 0.00%) 89.58 ( -2.11%)
    Percentage huge-24 90.50 ( 0.00%) 90.03 ( -0.52%)
    Percentage huge-30 91.57 ( 0.00%) 89.14 ( -2.65%)
    Percentage huge-32 91.00 ( 0.00%) 90.58 ( -0.46%)

    Negligible difference but this was likely a case when the specific
    corner case was not hit. A previous run of the same patch based on an
    earlier iteration of the series showed large differences where migration
    rates could be halved when the corner case was hit.

    The specific corner case where migration scan rates go through the roof
    was due to a dirty/writeback pageblock located at the boundary of the
    migration/free scanner did not happen in this case. When it does
    happen, the scan rates multipled by massive margins.

    Link: http://lkml.kernel.org/r/20190118175136.31341-13-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The migration scanner is a linear scan of a zone with a potentiall large
    search space. Furthermore, many pageblocks are unusable such as those
    filled with reserved pages or partially filled with pages that cannot
    migrate. These still get scanned in the common case of allocating a THP
    and the cost accumulates.

    The patch uses a partial search of the free lists to locate a migration
    source candidate that is marked as MOVABLE when allocating a THP. It
    prefers picking a block with a larger number of free pages already on
    the basis that there are fewer pages to migrate to free the entire
    block. The lowest PFN found during searches is tracked as the basis of
    the start for the linear search after the first search of the free list
    fails. After the search, the free list is shuffled so that the next
    search will not encounter the same page. If the search fails then the
    subsequent searches will be shorter and the linear scanner is used.

    If this search fails, or if the request is for a small or
    unmovable/reclaimable allocation then the linear scanner is still used.
    It is somewhat pointless to use the list search in those cases. Small
    free pages must be used for the search and there is no guarantee that
    movable pages are located within that block that are contiguous.

    5.0.0-rc1 5.0.0-rc1
    noboost-v3r10 findmig-v3r15
    Amean fault-both-3 3771.41 ( 0.00%) 3390.40 ( 10.10%)
    Amean fault-both-5 5409.05 ( 0.00%) 5082.28 ( 6.04%)
    Amean fault-both-7 7040.74 ( 0.00%) 7012.51 ( 0.40%)
    Amean fault-both-12 11887.35 ( 0.00%) 11346.63 ( 4.55%)
    Amean fault-both-18 16718.19 ( 0.00%) 15324.19 ( 8.34%)
    Amean fault-both-24 21157.19 ( 0.00%) 16088.50 * 23.96%*
    Amean fault-both-30 21175.92 ( 0.00%) 18723.42 * 11.58%*
    Amean fault-both-32 21339.03 ( 0.00%) 18612.01 * 12.78%*

    5.0.0-rc1 5.0.0-rc1
    noboost-v3r10 findmig-v3r15
    Percentage huge-3 86.50 ( 0.00%) 89.83 ( 3.85%)
    Percentage huge-5 92.52 ( 0.00%) 91.96 ( -0.61%)
    Percentage huge-7 92.44 ( 0.00%) 92.85 ( 0.44%)
    Percentage huge-12 92.98 ( 0.00%) 92.74 ( -0.25%)
    Percentage huge-18 91.70 ( 0.00%) 91.71 ( 0.02%)
    Percentage huge-24 91.59 ( 0.00%) 92.13 ( 0.60%)
    Percentage huge-30 90.14 ( 0.00%) 93.79 ( 4.04%)
    Percentage huge-32 90.03 ( 0.00%) 91.27 ( 1.37%)

    This shows an improvement in allocation latencies with similar
    allocation success rates. While not presented, there was a 31%
    reduction in migration scanning and a 8% reduction on system CPU usage.
    A 2-socket machine showed similar benefits.

    [mgorman@techsingularity.net: several fixes]
    Link: http://lkml.kernel.org/r/20190204120111.GL9565@techsingularity.net
    [vbabka@suse.cz: migrate block that was found-fast, some optimisations]
    Link: http://lkml.kernel.org/r/20190118175136.31341-10-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When compaction is finishing, it uses a flag to ensure the pageblock is
    complete but it makes sense to always complete migration of a pageblock.
    Minimally, skip information is based on a pageblock and partially
    scanned pageblocks may incur more scanning in the future. The pageblock
    skip handling also becomes more strict later in the series and the hint
    is more useful if a complete pageblock was always scanned.

    The potentially impacts latency as more scanning is done but it's not a
    consistent win or loss as the scanning is not always a high percentage
    of the pageblock and sometimes it is offset by future reductions in
    scanning. Hence, the results are not presented this time due to a
    misleading mix of gains/losses without any clear pattern. However, full
    scanning of the pageblock is important for later patches.

    Link: http://lkml.kernel.org/r/20190118175136.31341-8-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The last_migrated_pfn field is a bit dubious as to whether it really
    helps but either way, the information from it can be inferred without
    increasing the size of compact_control so remove the field.

    Link: http://lkml.kernel.org/r/20190118175136.31341-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • compact_control spans two cache lines with write-intensive lines on
    both. Rearrange so the most write-intensive fields are in the same
    cache line. This has a negligible impact on the overall performance of
    compaction and is more a tidying exercise than anything.

    Link: http://lkml.kernel.org/r/20190118175136.31341-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patch series "Increase success rates and reduce latency of compaction", v3.

    This series reduces scan rates and success rates of compaction,
    primarily by using the free lists to shorten scans, better controlling
    of skip information and whether multiple scanners can target the same
    block and capturing pageblocks before being stolen by parallel requests.
    The series is based on mmotm from January 9th, 2019 with the previous
    compaction series reverted.

    I'm mostly using thpscale to measure the impact of the series. The
    benchmark creates a large file, maps it, faults it, punches holes in the
    mapping so that the virtual address space is fragmented and then tries
    to allocate THP. It re-executes for different numbers of threads. From
    a fragmentation perspective, the workload is relatively benign but it
    does stress compaction.

    The overall impact on latencies for a 1-socket machine is

    baseline patches
    Amean fault-both-3 3832.09 ( 0.00%) 2748.56 * 28.28%*
    Amean fault-both-5 4933.06 ( 0.00%) 4255.52 ( 13.73%)
    Amean fault-both-7 7017.75 ( 0.00%) 6586.93 ( 6.14%)
    Amean fault-both-12 11610.51 ( 0.00%) 9162.34 * 21.09%*
    Amean fault-both-18 17055.85 ( 0.00%) 11530.06 * 32.40%*
    Amean fault-both-24 19306.27 ( 0.00%) 17956.13 ( 6.99%)
    Amean fault-both-30 22516.49 ( 0.00%) 15686.47 * 30.33%*
    Amean fault-both-32 23442.93 ( 0.00%) 16564.83 * 29.34%*

    The allocation success rates are much improved

    baseline patches
    Percentage huge-3 85.99 ( 0.00%) 97.96 ( 13.92%)
    Percentage huge-5 88.27 ( 0.00%) 96.87 ( 9.74%)
    Percentage huge-7 85.87 ( 0.00%) 94.53 ( 10.09%)
    Percentage huge-12 82.38 ( 0.00%) 98.44 ( 19.49%)
    Percentage huge-18 83.29 ( 0.00%) 99.14 ( 19.04%)
    Percentage huge-24 81.41 ( 0.00%) 97.35 ( 19.57%)
    Percentage huge-30 80.98 ( 0.00%) 98.05 ( 21.08%)
    Percentage huge-32 80.53 ( 0.00%) 97.06 ( 20.53%)

    That's a nearly perfect allocation success rate.

    The biggest impact is on the scan rates

    Compaction migrate scanned 55893379 19341254
    Compaction free scanned 474739990 11903963

    The number of pages scanned for migration was reduced by 65% and the
    free scanner was reduced by 97.5%. So much less work in exchange for
    lower latency and better success rates.

    The series was also evaluated using a workload that heavily fragments
    memory but the benefits there are also significant, albeit not
    presented.

    It was commented that we should be rethinking scanning entirely and to a
    large extent I agree. However, to achieve that you need a lot of this
    series in place first so it's best to make the linear scanners as best
    as possible before ripping them out.

    This patch (of 22):

    The isolate and migrate scanners should never isolate more than a
    pageblock of pages so unsigned int is sufficient saving 8 bytes on a
    64-bit build.

    Link: http://lkml.kernel.org/r/20190118175136.31341-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When freeing pages are done with higher order, time spent on coalescing
    pages by buddy allocator can be reduced. With section size of 256MB,
    hot add latency of a single section shows improvement from 50-60 ms to
    less than 1 ms, hence improving the hot add latency by 60 times. Modify
    external providers of online callback to align with the change.

    [arunks@codeaurora.org: v11]
    Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
    [akpm@linux-foundation.org: remove unused local, per Arun]
    [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
    [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
    [arunks@codeaurora.org: v8]
    Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
    [arunks@codeaurora.org: v9]
    Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
    Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Reviewed-by: Alexander Duyck
    Cc: K. Y. Srinivasan
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Dan Williams
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Greg Kroah-Hartman
    Cc: Mathieu Malaterre
    Cc: "Kirill A. Shutemov"
    Cc: Souptick Joarder
    Cc: Mel Gorman
    Cc: Aaron Lu
    Cc: Srivatsa Vaddagiri
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

29 Dec, 2018

3 commits

  • This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
    into alloc_flags. This is a preparation patch only that avoids having to
    pass gfp_mask through a long callchain in a future patch.

    Note that the setting in the fast path happens in alloc_flags_nofragment()
    and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
    That's true in this patch but is not true later so it's done now for
    easier review to show where the flag needs to be recorded.

    No functional change.

    [mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case]
    Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net
    Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patch series "Fragmentation avoidance improvements", v5.

    It has been noted before that fragmentation avoidance (aka
    anti-fragmentation) is not perfect. Given sufficient time or an adverse
    workload, memory gets fragmented and the long-term success of high-order
    allocations degrades. This series defines an adverse workload, a definition
    of external fragmentation events (including serious) ones and a series
    that reduces the level of those fragmentation events.

    The details of the workload and the consequences are described in more
    detail in the changelogs. However, from patch 1, this is a high-level
    summary of the adverse workload. The exact details are found in the
    mmtests implementation.

    The broad details of the workload are as follows;

    1. Create an XFS filesystem (not specified in the configuration but done
    as part of the testing for this patch)
    2. Start 4 fio threads that write a number of 64K files inefficiently.
    Inefficiently means that files are created on first access and not
    created in advance (fio parameterr create_on_open=1) and fallocate
    is not used (fallocate=none). With multiple IO issuers this creates
    a mix of slab and page cache allocations over time. The total size
    of the files is 150% physical memory so that the slabs and page cache
    pages get mixed
    3. Warm up a number of fio read-only threads accessing the same files
    created in step 2. This part runs for the same length of time it
    took to create the files. It'll fault back in old data and further
    interleave slab and page cache allocations. As it's now low on
    memory due to step 2, fragmentation occurs as pageblocks get
    stolen.
    4. While step 3 is still running, start a process that tries to allocate
    75% of memory as huge pages with a number of threads. The number of
    threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
    threads contending with fio, any other threads or forcing cross-NUMA
    scheduling. Note that the test has not been used on a machine with less
    than 8 cores. The benchmark records whether huge pages were allocated
    and what the fault latency was in microseconds
    5. Measure the number of events potentially causing external fragmentation,
    the fault latency and the huge page allocation success rate.
    6. Cleanup

    Overall the series reduces external fragmentation causing events by over 94%
    on 1 and 2 socket machines, which in turn impacts high-order allocation
    success rates over the long term. There are differences in latencies and
    high-order allocation success rates. Latencies are a mixed bag as they
    are vulnerable to exact system state and whether allocations succeeded
    so they are treated as a secondary metric.

    Patch 1 uses lower zones if they are populated and have free memory
    instead of fragmenting a higher zone. It's special cased to
    handle a Normal->DMA32 fallback with the reasons explained
    in the changelog.

    Patch 2-4 boosts watermarks temporarily when an external fragmentation
    event occurs. kswapd wakes to reclaim a small amount of old memory
    and then wakes kcompactd on completion to recover the system
    slightly. This introduces some overhead in the slowpath. The level
    of boosting can be tuned or disabled depending on the tolerance
    for fragmentation vs allocation latency.

    Patch 5 stalls some movable allocation requests to let kswapd from patch 4
    make some progress. The duration of the stalls is very low but it
    is possible to tune the system to avoid fragmentation events if
    larger stalls can be tolerated.

    The bulk of the improvement in fragmentation avoidance is from patches
    1-4 but patch 5 can deal with a rare corner case and provides the option
    of tuning a system for THP allocation success rates in exchange for
    some stalls to control fragmentation.

    This patch (of 5):

    The page allocator zone lists are iterated based on the watermarks of each
    zone which does not take anti-fragmentation into account. On x86, node 0
    may have multiple zones while other nodes have one zone. A consequence is
    that tasks running on node 0 may fragment ZONE_NORMAL even though
    ZONE_DMA32 has plenty of free memory. This patch special cases the
    allocator fast path such that it'll try an allocation from a lower local
    zone before fragmenting a higher zone. In this case, stealing of
    pageblocks or orders larger than a pageblock are still allowed in the fast
    path as they are uninteresting from a fragmentation point of view.

    This was evaluated using a benchmark designed to fragment memory before
    attempting THP allocations. It's implemented in mmtests as the following
    configurations

    configs/config-global-dhp__workload_thpfioscale
    configs/config-global-dhp__workload_thpfioscale-defrag
    configs/config-global-dhp__workload_thpfioscale-madvhugepage

    e.g. from mmtests
    ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1

    The broad details of the workload are as follows;

    1. Create an XFS filesystem (not specified in the configuration but done
    as part of the testing for this patch).
    2. Start 4 fio threads that write a number of 64K files inefficiently.
    Inefficiently means that files are created on first access and not
    created in advance (fio parameter create_on_open=1) and fallocate
    is not used (fallocate=none). With multiple IO issuers this creates
    a mix of slab and page cache allocations over time. The total size
    of the files is 150% physical memory so that the slabs and page cache
    pages get mixed.
    3. Warm up a number of fio read-only processes accessing the same files
    created in step 2. This part runs for the same length of time it
    took to create the files. It'll refault old data and further
    interleave slab and page cache allocations. As it's now low on
    memory due to step 2, fragmentation occurs as pageblocks get
    stolen.
    4. While step 3 is still running, start a process that tries to allocate
    75% of memory as huge pages with a number of threads. The number of
    threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
    threads contending with fio, any other threads or forcing cross-NUMA
    scheduling. Note that the test has not been used on a machine with less
    than 8 cores. The benchmark records whether huge pages were allocated
    and what the fault latency was in microseconds.
    5. Measure the number of events potentially causing external fragmentation,
    the fault latency and the huge page allocation success rate.
    6. Cleanup the test files.

    Note that due to the use of IO and page cache that this benchmark is not
    suitable for running on large machines where the time to fragment memory
    may be excessive. Also note that while this is one mix that generates
    fragmentation that it's not the only mix that generates fragmentation.
    Differences in workload that are more slab-intensive or whether SLUB is
    used with high-order pages may yield different results.

    When the page allocator fragments memory, it records the event using the
    mm_page_alloc_extfrag ftrace event. If the fallback_order is smaller than
    a pageblock order (order-9 on 64-bit x86) then it's considered to be an
    "external fragmentation event" that may cause issues in the future.
    Hence, the primary metric here is the number of external fragmentation
    events that occur with order < 9. The secondary metric is allocation
    latency and huge page allocation success rates but note that differences
    in latencies and what the success rate also can affect the number of
    external fragmentation event which is why it's a secondary metric.

    1-socket Skylake machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 1 THP allocating thread
    --------------------------------------

    4.20-rc3 extfrag events < order 9: 804694
    4.20-rc3+patch: 408912 (49% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-1 662.92 ( 0.00%) 653.58 * 1.41%*
    Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)

    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)

    Fault latencies are slightly reduced while allocation success rates remain
    at zero as this configuration does not make any special effort to allocate
    THP and fio is heavily active at the time and either filling memory or
    keeping pages resident. However, a 49% reduction of serious fragmentation
    events reduces the changes of external fragmentation being a problem in
    the future.

    Vlastimil asked during review for a breakdown of the allocation types
    that are falling back.

    vanilla
    3816 MIGRATE_UNMOVABLE
    800845 MIGRATE_MOVABLE
    33 MIGRATE_UNRECLAIMABLE

    patch
    735 MIGRATE_UNMOVABLE
    408135 MIGRATE_MOVABLE
    42 MIGRATE_UNRECLAIMABLE

    The majority of the fallbacks are due to movable allocations and this is
    consistent for the workload throughout the series so will not be presented
    again as the primary source of fallbacks are movable allocations.

    Movable fallbacks are sometimes considered "ok" to fallback because they
    can be migrated. The problem is that they can fill an
    unmovable/reclaimable pageblock causing those allocations to fallback
    later and polluting pageblocks with pages that cannot move. If there is a
    movable fallback, it is pretty much guaranteed to affect an
    unmovable/reclaimable pageblock and while it might not be enough to
    actually cause a unmovable/reclaimable fallback in the future, we cannot
    know that in advance so the patch takes the only option available to it.
    Hence, it's important to control them. This point is also consistent
    throughout the series and will not be repeated.

    1-socket Skylake machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 291392
    4.20-rc3+patch: 191187 (34% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-1 1495.14 ( 0.00%) 1467.55 ( 1.85%)
    Amean fault-huge-1 1098.48 ( 0.00%) 1127.11 ( -2.61%)

    thpfioscale Percentage Faults Huge
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-1 78.57 ( 0.00%) 77.64 ( -1.18%)

    Fragmentation events were reduced quite a bit although this is known
    to be a little variable. The latencies and allocation success rates
    are similar but they were already quite high.

    2-socket Haswell machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 5 THP allocating threads
    ----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 215698
    4.20-rc3+patch: 200210 (7% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-5 1350.05 ( 0.00%) 1346.45 ( 0.27%)
    Amean fault-huge-5 4181.01 ( 0.00%) 3418.60 ( 18.24%)

    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-5 1.15 ( 0.00%) 0.78 ( -31.88%)

    The reduction of external fragmentation events is slight and this is
    partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f
    ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
    allocations can now spill over to remote nodes instead of fragmenting
    local memory.

    2-socket Haswell machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 166352
    4.20-rc3+patch: 147463 (11% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-5 6138.97 ( 0.00%) 6217.43 ( -1.28%)
    Amean fault-huge-5 2294.28 ( 0.00%) 3163.33 * -37.88%*

    thpfioscale Percentage Faults Huge
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-5 96.82 ( 0.00%) 95.14 ( -1.74%)

    There was a slight reduction in external fragmentation events although the
    latencies were higher. The allocation success rate is high enough that
    the system is struggling and there is quite a lot of parallel reclaim and
    compaction activity. There is also a certain degree of luck on whether
    processes start on node 0 or not for this patch but the relevance is
    reduced later in the series.

    Overall, the patch reduces the number of external fragmentation causing
    events so the success of THP over long periods of time would be improved
    for this adverse workload.

    Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Zi Yan
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit fa5e084e43eb ("vmscan: do not unconditionally treat zones that
    fail zone_reclaim() as full") changed the return value of
    node_reclaim(). The original return value 0 means NODE_RECLAIM_SOME
    after this commit.

    While the return value of node_reclaim() when CONFIG_NUMA is n is not
    changed. This will leads to call zone_watermark_ok() again.

    This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
    Since node_reclaim() is only called in page_alloc.c, move it to
    mm/internal.h.

    Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Acked-by: Michal Hocko
    Reviewed-by: Matthew Wilcox
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     

31 Oct, 2018

1 commit

  • The conversion is done using

    sed -i 's@__free_pages_bootmem@memblock_free_pages@' \
    $(git grep -l __free_pages_bootmem)

    Link: http://lkml.kernel.org/r/1536927045-23536-27-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

24 Aug, 2018

1 commit

  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    The aim is to change the return type of finish_fault() and
    handle_mm_fault() to vm_fault_t type. As part of that clean up return
    type of all other recursively called functions have been changed to
    vm_fault_t type.

    The places from where handle_mm_fault() is getting invoked will be
    change to vm_fault_t type but in a separate patch.

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
    Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

23 Aug, 2018

1 commit

  • __paginginit is the same thing as __meminit except for platforms without
    sparsemem, there it is defined as __init.

    Remove __paginginit and use __meminit. Use __ref in one single function
    that merges __meminit and __init sections: setup_usemap().

    Link: http://lkml.kernel.org/r/20180801122348.21588-4-osalvador@techadventures.net
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Oscar Salvador
    Reviewed-by: Oscar Salvador
    Cc: Pasha Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

06 Jun, 2018

1 commit

  • Pull xfs updates from Darrick Wong:
    "New features this cycle include the ability to relabel mounted
    filesystems, support for fallocated swapfiles, and using FUA for pure
    data O_DSYNC directio writes. With this cycle we begin to integrate
    online filesystem repair and refactor the growfs code in preparation
    for eventual subvolume support, though the road ahead for both
    features is quite long.

    There are also numerous refactorings of the iomap code to remove
    unnecessary log overhead, to disentangle some of the quota code, and
    to prepare for buffer head removal in a future upstream kernel.

    Metadata validation continues to improve, both in the hot path
    veifiers and the online filesystem check code. I anticipate sending a
    second pull request in a few days with more metadata validation
    improvements.

    This series has been run through a full xfstests run over the weekend
    and through a quick xfstests run against this morning's master, with
    no major failures reported.

    Summary:

    - Strengthen inode number and structure validation when allocating
    inodes.

    - Reduce pointless buffer allocations during cache miss

    - Use FUA for pure data O_DSYNC directio writes

    - Various iomap refactorings

    - Strengthen quota metadata verification to avoid unfixable broken
    quota

    - Make AGFL block freeing a deferred operation to avoid blowing out
    transaction reservations when running complex operations

    - Get rid of the log item descriptors to reduce log overhead

    - Fix various reflink bugs where inodes were double-joined to
    transactions

    - Don't issue discards when trimming unwritten extents

    - Refactor incore dquot initialization and retrieval interfaces

    - Fix some locking problmes in the quota scrub code

    - Strengthen btree structure checks in scrub code

    - Rewrite swapfile activation to use iomap and support unwritten
    extents

    - Make scrub exit to userspace sooner when corruptions or
    cross-referencing problems are found

    - Make scrub invoke the data fork scrubber directly on metadata
    inodes

    - Don't do background reclamation of post-eof and cow blocks when the
    fs is suspended

    - Fix secondary superblock buffer lifespan hinting

    - Refactor growfs to use table-dispatched functions instead of long
    stringy functions

    - Move growfs code to libxfs

    - Implement online fs label getting and setting

    - Introduce online filesystem repair (in a very limited capacity)

    - Fix unit conversion problems in the realtime freemap iteration
    functions

    - Various refactorings and cleanups in preparation to remove buffer
    heads in a future release

    - Reimplement the old bmap call with iomap

    - Remove direct buffer head accesses from seek hole/data

    - Various bug fixes"

    * tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (121 commits)
    fs: use ->is_partially_uptodate in page_cache_seek_hole_data
    fs: remove the buffer_unwritten check in page_seek_hole_data
    fs: move page_cache_seek_hole_data to iomap.c
    xfs: use iomap_bmap
    iomap: add an iomap-based bmap implementation
    iomap: add a iomap_sector helper
    iomap: use __bio_add_page in iomap_dio_zero
    iomap: move IOMAP_F_BOUNDARY to gfs2
    iomap: fix the comment describing IOMAP_NOWAIT
    iomap: inline data should be an iomap type, not a flag
    mm: split ->readpages calls to avoid non-contiguous pages lists
    mm: return an unsigned int from __do_page_cache_readahead
    mm: give the 'ret' variable a better name __do_page_cache_readahead
    block: add a lower-level bio_add_page interface
    xfs: fix error handling in xfs_refcount_insert()
    xfs: fix xfs_rtalloc_rec units
    xfs: strengthen rtalloc query range checks
    xfs: xfs_rtbuf_get should check the bmapi_read results
    xfs: xfs_rtword_t should be unsigned, not signed
    dax: change bdev_dax_supported() to support boolean returns
    ...

    Linus Torvalds