15 Nov, 2020

2 commits

  • In isolate_migratepages_block, if we have too many isolated pages and
    nr_migratepages is not zero, we should try to migrate what we have
    without wasting time on isolating.

    In theory it's possible that multiple parallel compactions will cause
    too_many_isolated() to become true even if each has isolated less than
    COMPACT_CLUSTER_MAX, and loop forever in the while loop. Bailing
    immediately prevents that.

    [vbabka@suse.cz: changelog addition]

    Fixes: 1da2f328fa64 (“mm,thp,compaction,cma: allow THP migration for CMA allocations”)
    Suggested-by: Vlastimil Babka
    Signed-off-by: Zi Yan
    Signed-off-by: Andrew Morton
    Cc:
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Yang Shi
    Link: https://lkml.kernel.org/r/20201030183809.3616803-2-zi.yan@sent.com
    Signed-off-by: Linus Torvalds

    Zi Yan
     
  • In isolate_migratepages_block, when cc->alloc_contig is true, we are
    able to isolate compound pages. But nr_migratepages and nr_isolated did
    not count compound pages correctly, causing us to isolate more pages
    than we thought.

    So count compound pages as the number of base pages they contain.
    Otherwise, we might be trapped in too_many_isolated while loop, since
    the actual isolated pages can go up to COMPACT_CLUSTER_MAX*512=16384,
    where COMPACT_CLUSTER_MAX is 32, since we stop isolation after
    cc->nr_migratepages reaches to COMPACT_CLUSTER_MAX.

    In addition, after we fix the issue above, cc->nr_migratepages could
    never be equal to COMPACT_CLUSTER_MAX if compound pages are isolated,
    thus page isolation could not stop as we intended. Change the isolation
    stop condition to '>='.

    The issue can be triggered as follows:

    In a system with 16GB memory and an 8GB CMA region reserved by
    hugetlb_cma, if we first allocate 10GB THPs and mlock them (so some THPs
    are allocated in the CMA region and mlocked), reserving 6 1GB hugetlb
    pages via /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will
    get stuck (looping in too_many_isolated function) until we kill either
    task. With the patch applied, oom will kill the application with 10GB
    THPs and let hugetlb page reservation finish.

    [ziy@nvidia.com: v3]

    Link: https://lkml.kernel.org/r/20201030183809.3616803-1-zi.yan@sent.com
    Fixes: 1da2f328fa64 ("cmm,thp,compaction,cma: allow THP migration for CMA allocations")
    Signed-off-by: Zi Yan
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Acked-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc:
    Link: https://lkml.kernel.org/r/20201029200435.3386066-1-zi.yan@sent.com
    Signed-off-by: Linus Torvalds

    Zi Yan
     

17 Oct, 2020

1 commit

  • The current page_order() can only be called on pages in the buddy
    allocator. For compound pages, you have to use compound_order(). This is
    confusing and led to a bug, so rename page_order() to buddy_order().

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20201001152259.14932-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

14 Oct, 2020

1 commit

  • The same code can work both for 'zone->compact_considered > defer_limit'
    and 'zone->compact_considered >= defer_limit'. In the latter there is one
    branch less which is more effective considering performance.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Link: https://lkml.kernel.org/r/20200913190448.28649-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     

15 Aug, 2020

1 commit

  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

5 commits

  • Drop the repeated word "a".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-2-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • There is no compact_defer_limit. It should be compact_defer_shift in
    use. and add compact_order_failed explanation.

    Signed-off-by: Alex Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Alexander Duyck
    Link: http://lkml.kernel.org/r/3bd60e1b-a74e-050d-ade4-6e8f54e00b92@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Alex Shi
     
  • Proactive compaction uses per-node/zone "fragmentation score" which is
    always in range [0, 100], so use unsigned type of these scores as well as
    for related constants.

    Signed-off-by: Nitin Gupta
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Cc: Luis Chamberlain
    Cc: Kees Cook
    Cc: Iurii Zaikin
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com
    Signed-off-by: Linus Torvalds

    Nitin Gupta
     
  • Fix compile error when COMPACTION_HPAGE_ORDER is assigned to
    HUGETLB_PAGE_ORDER. The correct way to check if this constant is defined
    is to check for CONFIG_HUGETLBFS.

    Reported-by: Nathan Chancellor
    Signed-off-by: Nitin Gupta
    Signed-off-by: Andrew Morton
    Tested-by: Nathan Chancellor
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200623064544.25766-1-nigupta@nvidia.com
    Signed-off-by: Linus Torvalds

    Nitin Gupta
     
  • For some applications, we need to allocate almost all memory as hugepages.
    However, on a running system, higher-order allocations can fail if the
    memory is fragmented. Linux kernel currently does on-demand compaction as
    we request more hugepages, but this style of compaction incurs very high
    latency. Experiments with one-time full memory compaction (followed by
    hugepage allocations) show that kernel is able to restore a highly
    fragmented memory state to a fairly compacted memory state within 98% of free memory could be allocated as hugepages)

    - With 5.6.0-rc3 + this patch, with proactiveness=20

    sysctl -w vm.compaction_proactiveness=20

    percentile latency
    –––––––––– –––––––
    5 2
    10 2
    25 3
    30 3
    40 3
    50 4
    60 4
    75 4
    80 4
    90 5
    95 429

    Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
    total free => 98% of free memory could be allocated as hugepages)

    2. JAVA heap allocation

    In this test, we first fragment memory using the same method as for (1).

    Then, we start a Java process with a heap size set to 700G and request the
    heap to be allocated with THP hugepages. We also set THP to madvise to
    allow hugepage backing of this heap.

    /usr/bin/time
    java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

    The above command allocates 700G of Java heap using hugepages.

    - With vanilla 5.6.0-rc3

    17.39user 1666.48system 27:37.89elapsed

    - With 5.6.0-rc3 + this patch, with proactiveness=20

    8.35user 194.58system 3:19.62elapsed

    Elapsed time remains around 3:15, as proactiveness is further increased.

    Note that proactive compaction happens throughout the runtime of these
    workloads. The situation of one-time compaction, sufficient to supply
    hugepages for following allocation stream, can probably happen for more
    extreme proactiveness values, like 80 or 90.

    In the above Java workload, proactiveness is set to 20. The test starts
    with a node's score of 80 or higher, depending on the delay between the
    fragmentation step and starting the benchmark, which gives more-or-less
    time for the initial round of compaction. As t he benchmark consumes
    hugepages, node's score quickly rises above the high threshold (90) and
    proactive compaction starts again, which brings down the score to the low
    threshold level (80). Repeat.

    bpftrace also confirms proactive compaction running 20+ times during the
    runtime of this Java benchmark. kcompactd threads consume 100% of one of
    the CPUs while it tries to bring a node's score within thresholds.

    Backoff behavior
    ================

    Above workloads produce a memory state which is easy to compact. However,
    if memory is filled with unmovable pages, proactive compaction should
    essentially back off. To test this aspect:

    - Created a kernel driver that allocates almost all memory as hugepages
    followed by freeing first 3/4 of each hugepage.
    - Set proactiveness=40
    - Note that proactive_compact_node() is deferred maximum number of times
    with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
    (=> ~30 seconds between retries).

    [1] https://patchwork.kernel.org/patch/11098289/
    [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
    [3] https://lwn.net/Articles/817905/

    Signed-off-by: Nitin Gupta
    Signed-off-by: Andrew Morton
    Tested-by: Oleksandr Natalenko
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Khalid Aziz
    Reviewed-by: Oleksandr Natalenko
    Cc: Vlastimil Babka
    Cc: Khalid Aziz
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Mike Kravetz
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Nitin Gupta
    Cc: Oleksandr Natalenko
    Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
    Signed-off-by: Linus Torvalds

    Nitin Gupta
     

26 Jun, 2020

1 commit

  • Hugh reports:

    "While stressing compaction, one run oopsed on NULL capc->cc in
    __free_one_page()'s task_capc(zone): compact_zone_order() had been
    interrupted, and a page was being freed in the return from interrupt.

    Though you would not expect it from the source, both gccs I was using
    (4.8.1 and 7.5.0) had chosen to compile compact_zone_order() with the
    ".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before
    callq compact_zone - long after the "current->capture_control =
    &capc". An interrupt in between those finds capc->cc NULL (zeroed by
    an earlier rep stos).

    This could presumably be fixed by a barrier() before setting
    current->capture_control in compact_zone_order(); but would also need
    more care on return from compact_zone(), in order not to risk leaking
    a page captured by interrupt just before capture_control is reset.

    Maybe that is the preferable fix, but I felt safer for task_capc() to
    exclude the rather surprising possibility of capture at interrupt
    time"

    I have checked that gcc10 also behaves the same.

    The advantage of fix in compact_zone_order() is that we don't add
    another test in the page freeing hot path, and that it might prevent
    future problems if we stop exposing pointers to uninitialized structures
    in current task.

    So this patch implements the suggestion for compact_zone_order() with
    barrier() (and WRITE_ONCE() to prevent store tearing) for setting
    current->capture_control, and prevents page leaking with
    WRITE_ONCE/READ_ONCE in the proper order.

    Link: http://lkml.kernel.org/r/20200616082649.27173-1-vbabka@suse.cz
    Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
    Signed-off-by: Vlastimil Babka
    Reported-by: Hugh Dickins
    Suggested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Cc: Alex Shi
    Cc: Li Wang
    Cc: Mel Gorman
    Cc: [5.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

05 Jun, 2020

1 commit


04 Jun, 2020

5 commits

  • Merge more updates from Andrew Morton:
    "More mm/ work, plenty more to come

    Subsystems affected by this patch series: slub, memcg, gup, kasan,
    pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
    thp, mmap, kconfig"

    * akpm: (131 commits)
    arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
    x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
    riscv: support DEBUG_WX
    mm: add DEBUG_WX support
    drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
    mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
    powerpc/mm: drop platform defined pmd_mknotpresent()
    mm: thp: don't need to drain lru cache when splitting and mlocking THP
    hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
    sparc32: register memory occupied by kernel as memblock.memory
    include/linux/memblock.h: fix minor typo and unclear comment
    mm, mempolicy: fix up gup usage in lookup_node
    tools/vm/page_owner_sort.c: filter out unneeded line
    mm: swap: memcg: fix memcg stats for huge pages
    mm: swap: fix vmstats for huge pages
    mm: vmscan: limit the range of LRU type balancing
    mm: vmscan: reclaim writepage is IO cost
    mm: vmscan: determine anon/file pressure balance at the reclaim root
    mm: balance LRU lists based on relative thrashing
    mm: only count actual rotations as LRU reclaim cost
    ...

    Linus Torvalds
     
  • Pageblock migrate type is encoded in GFP flags, just as zone_type and
    zonelist.

    Currently we use gfp_zone() and gfp_zonelist() to extract related
    information, it would be proper to use the same naming convention for
    migrate type.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200329080823.7735-1-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • classzone_idx is just different name for high_zoneidx now. So, integrate
    them and add some comment to struct alloc_context in order to reduce
    future confusion about the meaning of this variable.

    The accessor, ac_classzone_idx() is also removed since it isn't needed
    after integration.

    In addition to integration, this patch also renames high_zoneidx to
    highest_zoneidx since it represents more precise meaning.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Acked-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Ye Xiaolong
    Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When called during boot the memmap_init_zone() function checks if each PFN
    is valid and actually belongs to the node being initialized using
    early_pfn_valid() and early_pfn_in_nid().

    Each such check may cost up to O(log(n)) where n is the number of memory
    banks, so for large amount of memory overall time spent in early_pfn*()
    becomes substantial.

    Since the information is anyway present in memblock, we can iterate over
    memblock memory regions in memmap_init() and only call memmap_init_zone()
    for PFN ranges that are know to be valid and in the appropriate node.

    [cai@lca.pw: fix a compilation warning from Clang]
    Link: http://lkml.kernel.org/r/CF6E407F-17DC-427C-8203-21979FB882EF@lca.pw
    [bhe@redhat.com: fix the incorrect hole in fast_isolate_freepages()]
    Link: http://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
    Link: http://lkml.kernel.org/r/20200521014407.29690-1-bhe@redhat.com
    Signed-off-by: Baoquan He
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Tested-by: Hoan Tran [arm64]
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: "James E.J. Bottomley"
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200412194859.12663-16-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Pull networking updates from David Miller:

    1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

    2) Add GSO partial support to igc, from Sasha Neftin.

    3) Several cleanups and improvements to r8169 from Heiner Kallweit.

    4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

    5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

    6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

    7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

    8) Add sriov and vf support to hinic, from Luo bin.

    9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

    10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

    11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

    12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

    13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

    14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

    15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

    16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

    17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

    18) Several RISCV bpf jit optimizations, from Luke Nelson.

    19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

    20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

    21) Add BPF iterators, from Yonghang Song.

    22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

    23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

    24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

    25) Add CAP_BPF, from Alexei Starovoitov.

    26) Support terse dumps in the packet scheduler, from Vlad Buslov.

    27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

    28) Add devm_register_netdev(), from Bartosz Golaszewski.

    29) Minimize qdisc resets, from Cong Wang.

    30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
    selftests: net: ip_defrag: ignore EPERM
    net_failover: fixed rollback in net_failover_open()
    Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
    Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
    vmxnet3: allow rx flow hash ops only when rss is enabled
    hinic: add set_channels ethtool_ops support
    selftests/bpf: Add a default $(CXX) value
    tools/bpf: Don't use $(COMPILE.c)
    bpf, selftests: Use bpf_probe_read_kernel
    s390/bpf: Use bcr 0,%0 as tail call nop filler
    s390/bpf: Maintain 8-byte stack alignment
    selftests/bpf: Fix verifier test
    selftests/bpf: Fix sample_cnt shared between two threads
    bpf, selftests: Adapt cls_redirect to call csum_level helper
    bpf: Add csum_level helper for fixing up csum levels
    bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
    sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
    crypto/chtls: IPv6 support for inline TLS
    Crypto/chcr: Fixes a coccinile check error
    Crypto/chcr: Fixes compilations warnings
    ...

    Linus Torvalds
     

28 May, 2020

1 commit

  • The various struct pagevec per CPU variables are protected by disabling
    either preemption or interrupts across the critical sections. Inside
    these sections spinlocks have to be acquired.

    These spinlocks are regular spinlock_t types which are converted to
    "sleeping" spinlocks on PREEMPT_RT enabled kernels. Obviously sleeping
    locks cannot be acquired in preemption or interrupt disabled sections.

    local locks provide a trivial way to substitute preempt and interrupt
    disable instances. On a non PREEMPT_RT enabled kernel local_lock() maps
    to preempt_disable() and local_lock_irq() to local_irq_disable().

    Create lru_rotate_pvecs containing the pagevec and the locallock.
    Create lru_pvecs containing the remaining pagevecs and the locallock.
    Add lru_add_drain_cpu_zone() which is used from compact_zone() to avoid
    exporting the pvec structure.

    Change the relevant call sites to acquire these locks instead of using
    preempt_disable() / get_cpu() / get_cpu_var() and local_irq_disable() /
    local_irq_save().

    There is neither a functional change nor a change in the generated
    binary code for non PREEMPT_RT enabled non-debug kernels.

    When lockdep is enabled local locks have lockdep maps embedded. These
    allow lockdep to validate the protections, i.e. inappropriate usage of a
    preemption only protected sections would result in a lockdep warning
    while the same problem would not be noticed with a plain
    preempt_disable() based protection.

    local locks also improve readability as they provide a named scope for
    the protections while preempt/interrupt disable are opaque scopeless.

    Finally local locks allow PREEMPT_RT to substitute them with real
    locking primitives to ensure the correctness of operation in a fully
    preemptible kernel.

    [ bigeasy: Adopted to use local_lock ]

    Signed-off-by: Ingo Molnar
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Ingo Molnar
    Acked-by: Peter Zijlstra
    Link: https://lore.kernel.org/r/20200527201119.1692513-4-bigeasy@linutronix.de

    Ingo Molnar
     

27 Apr, 2020

1 commit

  • Instead of having all the sysctl handlers deal with user pointers, which
    is rather hairy in terms of the BPF interaction, copy the input to and
    from userspace in common code. This also means that the strings are
    always NUL-terminated by the common code, making the API a little bit
    safer.

    As most handler just pass through the data to one of the common handlers
    a lot of the changes are mechnical.

    Signed-off-by: Christoph Hellwig
    Acked-by: Andrey Ignatov
    Signed-off-by: Al Viro

    Christoph Hellwig
     

08 Apr, 2020

2 commits

  • Sparse reports a warning at compact_lock_irqsave()

    warning: context imbalance in compact_lock_irqsave() - wrong count at exit

    The root cause is the missing annotation at compact_lock_irqsave()
    Add the missing __acquires(lock) annotation.

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200214204741.94112-6-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     
  • Some comments for MADV_FREE is revised and added to help people understand
    the MADV_FREE code, especially the page flag, PG_swapbacked. This makes
    page_is_file_cache() isn't consistent with its comments. So the function
    is renamed to page_is_file_lru() to make them consistent again. All these
    are put in one patch as one logical change.

    Suggested-by: David Hildenbrand
    Suggested-by: Johannes Weiner
    Suggested-by: David Rientjes
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Acked-by: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     

03 Apr, 2020

4 commits

  • Previously 0 was assigned to variable 'last_migrated_pfn'. But the
    variable is not read after that, so the assignment can be removed.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Link: http://lkml.kernel.org/r/20200318174509.15021-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • Since commit 5bbe3547aa3ba ("mm: allow compaction of unevictable pages")
    it is allowed to examine mlocked pages and compact them by default. On
    -RT even minor pagefaults are problematic because it may take a few 100us
    to resolve them and until then the task is blocked.

    Make compact_unevictable_allowed = 0 default and issue a warning on RT if
    it is changed.

    [bigeasy@linutronix.de: v5]
    Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
    Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Thomas Gleixner
    Cc: Luis Chamberlain
    Cc: Kees Cook
    Cc: Iurii Zaikin
    Cc: Vlastimil Babka
    Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
    Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.de
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • Dan reports:

    The patch 5e1f0f098b46: "mm, compaction: capture a page under direct
    compaction" from Mar 5, 2019, leads to the following Smatch complaint:

    mm/compaction.c:2321 compact_zone_order()
    error: we previously assumed 'capture' could be null (see line 2313)

    mm/compaction.c
    2288 static enum compact_result compact_zone_order(struct zone *zone, int order,
    2289 gfp_t gfp_mask, enum compact_priority prio,
    2290 unsigned int alloc_flags, int classzone_idx,
    2291 struct page **capture)
    ^^^^^^^

    2313 if (capture)
    ^^^^^^^
    Check for NULL

    2314 current->capture_control = &capc;
    2315
    2316 ret = compact_zone(&cc, &capc);
    2317
    2318 VM_BUG_ON(!list_empty(&cc.freepages));
    2319 VM_BUG_ON(!list_empty(&cc.migratepages));
    2320
    2321 *capture = capc.page;
    ^^^^^^^^
    Unchecked dereference.

    2322 current->capture_control = NULL;
    2323

    In practice this is not an issue, as the only caller path passes non-NULL
    capture:

    __alloc_pages_direct_compact()
    struct page *page = NULL;
    try_to_compact_pages(capture = &page);
    compact_zone_order(capture = capture);

    So let's remove the unnecessary check, which should also make Smatch happy.

    Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
    Reported-by: Dan Carpenter
    Suggested-by: Andrew Morton
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Acked-by: Mel Gorman
    Link: http://lkml.kernel.org/r/18b0df3c-0589-d96c-23fa-040798fee187@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The code to implement THP migrations already exists, and the code for CMA
    to clear out a region of memory already exists.

    Only a few small tweaks are needed to allow CMA to move THP memory when
    attempting an allocation from alloc_contig_range.

    With these changes, migrating THPs from a CMA area works when allocating a
    1GB hugepage from CMA memory.

    [riel@surriel.com: fix hugetlbfs pages per Mike, cleanup per Vlastimil]
    Link: http://lkml.kernel.org/r/20200228104700.0af2f18d@imladris.surriel.com
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Reviewed-by: Zi Yan
    Reviewed-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200227213238.1298752-2-riel@surriel.com
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

15 Oct, 2019

1 commit

  • Florian and Dave reported [1] a NULL pointer dereference in
    __reset_isolation_pfn(). While the exact cause is unclear, staring at
    the code revealed two bugs, which might be related.

    One bug is that if zone starts in the middle of pageblock, block_page
    might correspond to different pfn than block_pfn, and then the
    pfn_valid_within() checks will check different pfn's than those accessed
    via struct page. This might result in acessing an unitialized page in
    CONFIG_HOLES_IN_ZONE configs.

    The other bug is that end_page refers to the first page of next
    pageblock and not last page of current pageblock. The online and valid
    check is then wrong and with sections, the while (page < end_page) loop
    might wander off actual struct page arrays.

    [1] https://lore.kernel.org/linux-xfs/87o8z1fvqu.fsf@mid.deneb.enyo.de/

    Link: http://lkml.kernel.org/r/20191008152915.24704-1-vbabka@suse.cz
    Fixes: 6b0868c820ff ("mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints")
    Signed-off-by: Vlastimil Babka
    Reported-by: Florian Weimer
    Reported-by: Dave Chinner
    Acked-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

25 Sep, 2019

3 commits

  • Like commit 40cacbcb3240 ("mm, compaction: remove unnecessary zone
    parameter in some instances"), remove unnecessary zone parameter.

    No functional change.

    Link: http://lkml.kernel.org/r/20190806151616.21107-1-lpf.vector@gmail.com
    Signed-off-by: Pengfei Li
    Reviewed-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Qian Cai
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pengfei Li
     
  • total_{migrate,free}_scanned will be added to COMPACTMIGRATE_SCANNED and
    COMPACTFREE_SCANNED in compact_zone(). We should clear them before
    scanning a new zone. In the proc triggered compaction, we forgot clearing
    them.

    [laoar.shao@gmail.com: introduce a helper compact_zone_counters_init()]
    Link: http://lkml.kernel.org/r/1563869295-25748-1-git-send-email-laoar.shao@gmail.com
    [akpm@linux-foundation.org: expand compact_zone_counters_init() into its single callsite, per mhocko]
    [vbabka@suse.cz: squash compact_zone() list_head init as well]
    Link: http://lkml.kernel.org/r/1fb6f7da-f776-9e42-22f8-bbb79b030b98@suse.cz
    [akpm@linux-foundation.org: kcompactd_do_work(): avoid unnecessary initialization of cc.zone]
    Link: http://lkml.kernel.org/r/1563789275-9639-1-git-send-email-laoar.shao@gmail.com
    Fixes: 7f354a548d1c ("mm, compaction: add vmstats for kcompactd work")
    Signed-off-by: Yafang Shao
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Yafang Shao
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

03 Aug, 2019

1 commit

  • "howaboutsynergy" reported via kernel buzilla number 204165 that
    compact_zone_order was consuming 100% CPU during a stress test for
    prolonged periods of time. Specifically the following command, which
    should exit in 10 seconds, was taking an excessive time to finish while
    the CPU was pegged at 100%.

    stress -m 220 --vm-bytes 1000000000 --timeout 10

    Tracing indicated a pattern as follows

    stress-3923 [007] 519.106208: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106212: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106216: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106219: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106223: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106227: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106231: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106235: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106238: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
    stress-3923 [007] 519.106242: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0

    Note that compaction is entered in rapid succession while scanning and
    isolating nothing. The problem is that when a task that is compacting
    receives a fatal signal, it retries indefinitely instead of exiting
    while making no progress as a fatal signal is pending.

    It's not easy to trigger this condition although enabling zswap helps on
    the basis that the timing is altered. A very small window has to be hit
    for the problem to occur (signal delivered while compacting and
    isolating a PFN for migration that is not aligned to SWAP_CLUSTER_MAX).

    This was reproduced locally -- 16G single socket system, 8G swap, 30%
    zswap configured, vm-bytes 22000000000 using Colin Kings stress-ng
    implementation from github running in a loop until the problem hits).
    Tracing recorded the problem occurring almost 200K times in a short
    window. With this patch, the problem hit 4 times but the task existed
    normally instead of consuming CPU.

    This problem has existed for some time but it was made worse by commit
    cf66f0700c8f ("mm, compaction: do not consider a need to reschedule as
    contention"). Before that commit, if the same condition was hit then
    locks would be quickly contended and compaction would exit that way.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204165
    Link: http://lkml.kernel.org/r/20190718085708.GE24383@techsingularity.net
    Fixes: cf66f0700c8f ("mm, compaction: do not consider a need to reschedule as contention")
    Signed-off-by: Mel Gorman
    Reviewed-by: Vlastimil Babka
    Cc: [5.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

02 Jun, 2019

1 commit

  • When we have holes in a normal memory zone, we could endup having
    cached_migrate_pfns which may not necessarily be valid, under heavy memory
    pressure with swapping enabled ( via __reset_isolation_suitable(),
    triggered by kswapd).

    Later if we fail to find a page via fast_isolate_freepages(), we may end
    up using the migrate_pfn we started the search with, as valid page. This
    could lead to accessing NULL pointer derefernces like below, due to an
    invalid mem_section pointer.

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 [47/1825]
    Mem abort info:
    ESR = 0x96000004
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000004
    CM = 0, WnR = 0
    user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000082f94ae9
    [0000000000000008] pgd=0000000000000000
    Internal error: Oops: 96000004 [#1] SMP
    ...
    CPU: 10 PID: 6080 Comm: qemu-system-aar Not tainted 510-rc1+ #6
    Hardware name: AmpereComputing(R) OSPREY EV-883832-X3-0001/OSPREY, BIOS 4819 09/25/2018
    pstate: 60000005 (nZCv daif -PAN -UAO)
    pc : set_pfnblock_flags_mask+0x58/0xe8
    lr : compaction_alloc+0x300/0x950
    [...]
    Process qemu-system-aar (pid: 6080, stack limit = 0x0000000095070da5)
    Call trace:
    set_pfnblock_flags_mask+0x58/0xe8
    compaction_alloc+0x300/0x950
    migrate_pages+0x1a4/0xbb0
    compact_zone+0x750/0xde8
    compact_zone_order+0xd8/0x118
    try_to_compact_pages+0xb4/0x290
    __alloc_pages_direct_compact+0x84/0x1e0
    __alloc_pages_nodemask+0x5e0/0xe18
    alloc_pages_vma+0x1cc/0x210
    do_huge_pmd_anonymous_page+0x108/0x7c8
    __handle_mm_fault+0xdd4/0x1190
    handle_mm_fault+0x114/0x1c0
    __get_user_pages+0x198/0x3c0
    get_user_pages_unlocked+0xb4/0x1d8
    __gfn_to_pfn_memslot+0x12c/0x3b8
    gfn_to_pfn_prot+0x4c/0x60
    kvm_handle_guest_abort+0x4b0/0xcd8
    handle_exit+0x140/0x1b8
    kvm_arch_vcpu_ioctl_run+0x260/0x768
    kvm_vcpu_ioctl+0x490/0x898
    do_vfs_ioctl+0xc4/0x898
    ksys_ioctl+0x8c/0xa0
    __arm64_sys_ioctl+0x28/0x38
    el0_svc_common+0x74/0x118
    el0_svc_handler+0x38/0x78
    el0_svc+0x8/0xc
    Code: f8607840 f100001f 8b011401 9a801020 (f9400400)
    ---[ end trace af6a35219325a9b6 ]---

    The issue was reported on an arm64 server with 128GB with holes in the
    zone (e.g, [32GB@4GB, 96GB@544GB]), with a swap device enabled, while
    running 100 KVM guest instances.

    This patch fixes the issue by ensuring that the page belongs to a valid
    PFN when we fallback to using the lower limit of the scan range upon
    failure in fast_isolate_freepages().

    Link: http://lkml.kernel.org/r/1558711908-15688-1-git-send-email-suzuki.poulose@arm.com
    Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a migration target")
    Signed-off-by: Suzuki K Poulose
    Reported-by: Marc Zyngier
    Reviewed-by: Mel Gorman
    Reviewed-by: Anshuman Khandual
    Cc: Michal Hocko
    Cc: Qian Cai
    Cc: Marc Zyngier
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suzuki K Poulose
     

19 May, 2019

1 commit

  • syzbot reported the following error from a tree with a head commit of
    baf76f0c58ae ("slip: make slhc_free() silently accept an error pointer")

    BUG: unable to handle kernel paging request at ffffea0003348000
    #PF error: [normal kernel read fault]
    PGD 12c3f9067 P4D 12c3f9067 PUD 12c3f8067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP KASAN
    CPU: 1 PID: 28916 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #89
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:314 [inline]
    RIP: 0010:PageCompound include/linux/page-flags.h:186 [inline]
    RIP: 0010:isolate_freepages_block+0x1c0/0xd40 mm/compaction.c:579
    Code: 01 d8 ff 4d 85 ed 0f 84 ef 07 00 00 e8 29 00 d8 ff 4c 89 e0 83 85 38 ff
    ff ff 01 48 c1 e8 03 42 80 3c 38 00 0f 85 31 0a 00 00 8b 2c 24 31 ff 49
    c1 ed 10 41 83 e5 01 44 89 ee e8 3a 01 d8 ff
    RSP: 0018:ffff88802b31eab8 EFLAGS: 00010246
    RAX: 1ffffd4000669000 RBX: 00000000000cd200 RCX: ffffc9000a235000
    RDX: 000000000001ca5e RSI: ffffffff81988cc7 RDI: 0000000000000001
    RBP: ffff88802b31ebd8 R08: ffff88805af700c0 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0003348000
    R13: 0000000000000000 R14: ffff88802b31f030 R15: dffffc0000000000
    FS: 00007f61648dc700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffea0003348000 CR3: 0000000037c64000 CR4: 00000000001426e0
    Call Trace:
    fast_isolate_around mm/compaction.c:1243 [inline]
    fast_isolate_freepages mm/compaction.c:1418 [inline]
    isolate_freepages mm/compaction.c:1438 [inline]
    compaction_alloc+0x1aee/0x22e0 mm/compaction.c:1550

    There is no reproducer and it is difficult to hit -- 1 crash every few
    days. The issue is very similar to the fix in commit 6b0868c820ff
    ("mm/compaction.c: correct zone boundary handling when resetting pageblock
    skip hints"). When isolating free pages around a target pageblock, the
    boundary handling is off by one and can stray into the next pageblock.
    Triggering the syzbot error requires that the end of pageblock is section
    or zone aligned, and that the next section is unpopulated.

    A more subtle consequence of the bug is that pageblocks were being
    improperly used as migration targets which potentially hurts fragmentation
    avoidance in the long-term one page at a time.

    A debugging patch revealed that it's definitely possible to stray outside
    of a pageblock which is not intended. While syzbot cannot be used to
    verify this patch, it was confirmed that the debugging warning no longer
    triggers with this patch applied. It has also been confirmed that the THP
    allocation stress tests are not degraded by this patch.

    Link: http://lkml.kernel.org/r/20190510182124.GI18914@techsingularity.net
    Fixes: e332f741a8dd ("mm, compaction: be selective about what pageblocks to clear skip hints")
    Signed-off-by: Mel Gorman
    Reported-by: syzbot+d84c80f9fe26a0f7a734@syzkaller.appspotmail.com
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Qian Cai
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: # v5.1+
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

15 May, 2019

2 commits

  • In preparation for runtime randomization of the zone lists, take all
    (well, most of) the list_*() functions in the buddy allocator and put
    them in helper functions. Provide a common control point for injecting
    additional behavior when freeing pages.

    [dan.j.williams@intel.com: fix buddy list helpers]
    Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
    [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
    Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
    Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Vlastimil Babka
    Tested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Kees Cook
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • In a low-memory situation, cc->fast_search_fail can keep increasing as it
    is unable to find an available page to isolate in
    fast_isolate_freepages(). As the result, it could trigger an error below,
    so just compare with the maximum bits can be shifted first.

    UBSAN: Undefined behaviour in mm/compaction.c:1160:30
    shift exponent 64 is too large for 64-bit type 'unsigned long'
    CPU: 131 PID: 1308 Comm: kcompactd1 Kdump: loaded Tainted: G
    W L 5.0.0+ #17
    Call trace:
    dump_backtrace+0x0/0x450
    show_stack+0x20/0x2c
    dump_stack+0xc8/0x14c
    __ubsan_handle_shift_out_of_bounds+0x7e8/0x8c4
    compaction_alloc+0x2344/0x2484
    unmap_and_move+0xdc/0x1dbc
    migrate_pages+0x274/0x1310
    compact_zone+0x26ec/0x43bc
    kcompactd+0x15b8/0x1a24
    kthread+0x374/0x390
    ret_from_fork+0x10/0x18

    [akpm@linux-foundation.org: code cleanup]
    Link: http://lkml.kernel.org/r/20190320203338.53367-1-cai@lca.pw
    Fixes: 70b44595eafe ("mm, compaction: use free lists to quickly locate a migration source")
    Signed-off-by: Qian Cai
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

04 Apr, 2019

2 commits

  • Running LTP oom01 in a tight loop or memory stress testing put the system
    in a low-memory situation could triggers random memory corruption like
    page flag corruption below due to in fast_isolate_freepages(), if
    isolation fails, next_search_order() does not abort the search immediately
    could lead to improper accesses.

    UBSAN: Undefined behaviour in ./include/linux/mm.h:1195:50
    index 7 is out of range for type 'zone [5]'
    Call Trace:
    dump_stack+0x62/0x9a
    ubsan_epilogue+0xd/0x7f
    __ubsan_handle_out_of_bounds+0x14d/0x192
    __isolate_free_page+0x52c/0x600
    compaction_alloc+0x886/0x25f0
    unmap_and_move+0x37/0x1e70
    migrate_pages+0x2ca/0xb20
    compact_zone+0x19cb/0x3620
    kcompactd_do_work+0x2df/0x680
    kcompactd+0x1d8/0x6c0
    kthread+0x32c/0x3f0
    ret_from_fork+0x35/0x40
    ------------[ cut here ]------------
    kernel BUG at mm/page_alloc.c:3124!
    invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
    RIP: 0010:__isolate_free_page+0x464/0x600
    RSP: 0000:ffff888b9e1af848 EFLAGS: 00010007
    RAX: 0000000030000000 RBX: ffff888c39fcf0f8 RCX: 0000000000000000
    RDX: 1ffff111873f9e25 RSI: 0000000000000004 RDI: ffffed1173c35ef6
    RBP: ffff888b9e1af898 R08: fffffbfff4fc2461 R09: fffffbfff4fc2460
    R10: fffffbfff4fc2460 R11: ffffffffa7e12303 R12: 0000000000000008
    R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000007
    FS: 0000000000000000(0000) GS:ffff888ba8e80000(0000)
    knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc7abc00000 CR3: 0000000752416004 CR4: 00000000001606a0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    compaction_alloc+0x886/0x25f0
    unmap_and_move+0x37/0x1e70
    migrate_pages+0x2ca/0xb20
    compact_zone+0x19cb/0x3620
    kcompactd_do_work+0x2df/0x680
    kcompactd+0x1d8/0x6c0
    kthread+0x32c/0x3f0
    ret_from_fork+0x35/0x40

    Link: http://lkml.kernel.org/r/20190320192648.52499-1-cai@lca.pw
    Fixes: dbe2d4e4f12e ("mm, compaction: round-robin the order while searching the free lists for a target")
    Signed-off-by: Qian Cai
    Acked-by: Mel Gorman
    Cc: Daniel Jordan
    Cc: Mikhail Gavrilov
    Cc: Vlastimil Babka
    Cc: Pavel Tatashin
    Signed-off-by: Mel Gorman

    Qian Cai
     
  • Mikhail Gavrilo reported the following bug being triggered in a Fedora
    kernel based on 5.1-rc1 but it is relevant to a vanilla kernel.

    kernel: page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    kernel: ------------[ cut here ]------------
    kernel: kernel BUG at include/linux/mm.h:1021!
    kernel: invalid opcode: 0000 [#1] SMP NOPTI
    kernel: CPU: 6 PID: 116 Comm: kswapd0 Tainted: G C 5.1.0-0.rc1.git1.3.fc31.x86_64 #1
    kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 1201 12/07/2018
    kernel: RIP: 0010:__reset_isolation_pfn+0x244/0x2b0
    kernel: Code: fe 06 e8 0f 8e fc ff 44 0f b6 4c 24 04 48 85 c0 0f 85 dc fe ff ff e9 68 fe ff ff 48 c7 c6 58 b7 2e 8c 4c 89 ff e8 0c 75 00 00 0b 48 c7 c6 58 b7 2e 8c e8 fe 74 00 00 0f 0b 48 89 fa 41 b8 01
    kernel: RSP: 0018:ffff9e2d03f0fde8 EFLAGS: 00010246
    kernel: RAX: 0000000000000034 RBX: 000000000081f380 RCX: ffff8cffbddd6c20
    kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8cffbddd6c20
    kernel: RBP: 0000000000000001 R08: 0000009898b94613 R09: 0000000000000000
    kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000100000
    kernel: R13: 0000000000100000 R14: 0000000000000001 R15: ffffca7de07ce000
    kernel: FS: 0000000000000000(0000) GS:ffff8cffbdc00000(0000) knlGS:0000000000000000
    kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    kernel: CR2: 00007fc1670e9000 CR3: 00000007f5276000 CR4: 00000000003406e0
    kernel: Call Trace:
    kernel: __reset_isolation_suitable+0x62/0x120
    kernel: reset_isolation_suitable+0x3b/0x40
    kernel: kswapd+0x147/0x540
    kernel: ? finish_wait+0x90/0x90
    kernel: kthread+0x108/0x140
    kernel: ? balance_pgdat+0x560/0x560
    kernel: ? kthread_park+0x90/0x90
    kernel: ret_from_fork+0x27/0x50

    He bisected it down to e332f741a8dd ("mm, compaction: be selective about
    what pageblocks to clear skip hints"). The problem is that the patch in
    question was sloppy with respect to the handling of zone boundaries. In
    some instances, it was possible for PFNs outside of a zone to be examined
    and if those were not properly initialised or poisoned then it would
    trigger the VM_BUG_ON. This patch corrects the zone boundary issues when
    resetting pageblock skip hints and Mikhail reported that the bug did not
    trigger after 30 hours of testing.

    Link: http://lkml.kernel.org/r/20190327085424.GL3189@techsingularity.net
    Fixes: e332f741a8dd ("mm, compaction: be selective about what pageblocks to clear skip hints")
    Reported-by: Mikhail Gavrilov
    Tested-by: Mikhail Gavrilov
    Cc: Daniel Jordan
    Cc: Qian Cai
    Cc: Vlastimil Babka
    Signed-off-by: Mel Gorman

    Mel Gorman
     

06 Mar, 2019

4 commits

  • too_many_isolated() in mm/compaction.c looks only at node state, so it
    makes more sense to change argument to pgdat instead of zone.

    Link: http://lkml.kernel.org/r/20190228083329.31892-3-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Vlastimil Babka
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: William Kucharski
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • We have common pattern to access lru_lock from a page pointer:
    zone_lru_lock(page_zone(page))

    Which is silly, because it unfolds to this:
    &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
    while we can simply do
    &NODE_DATA(page_to_nid(page))->lru_lock

    Remove zone_lru_lock() function, since it's only complicate things. Use
    'page_pgdat(page)->lru_lock' pattern instead.

    [aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()]
    Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: William Kucharski
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Compaction is inherently race-prone as a suitable page freed during
    compaction can be allocated by any parallel task. This patch uses a
    capture_control structure to isolate a page immediately when it is freed
    by a direct compactor in the slow path of the page allocator. The
    intent is to avoid redundant scanning.

    5.0.0-rc1 5.0.0-rc1
    selective-v3r17 capture-v3r19
    Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
    Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%)
    Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%)
    Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%)
    Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%)
    Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%)
    Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%*
    Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%)
    Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%)

    Latency is only moderately affected but the devil is in the details. A
    closer examination indicates that base page fault latency is reduced but
    latency of huge pages is increased as it takes creater care to succeed.
    Part of the "problem" is that allocation success rates are close to 100%
    even when under pressure and compaction gets harder

    5.0.0-rc1 5.0.0-rc1
    selective-v3r17 capture-v3r19
    Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%)
    Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%)
    Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%)
    Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%)
    Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%)
    Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%)
    Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%)
    Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%)

    And scan rates are reduced as expected by 6% for the migration scanner
    and 29% for the free scanner indicating that there is less redundant
    work.

    Compaction migrate scanned 20815362 19573286
    Compaction free scanned 16352612 11510663

    [mgorman@techsingularity.net: remove redundant check]
    Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
    Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Pageblock hints are cleared when compaction restarts or kswapd makes
    enough progress that it can sleep but it's over-eager in that the bit is
    cleared for migration sources with no LRU pages and migration targets
    with no free pages. As pageblock skip hint flushes are relatively rare
    and out-of-band with respect to kswapd, this patch makes a few more
    expensive checks to see if it's appropriate to even clear the bit.
    Every pageblock that is not cleared will avoid 512 pages being scanned
    unnecessarily on x86-64.

    The impact is variable with different workloads showing small
    differences in latency, success rates and scan rates. This is expected
    as clearing the hints is not that common but doing a small amount of
    work out-of-band to avoid a large amount of work in-band later is
    generally a good thing.

    Link: http://lkml.kernel.org/r/20190118175136.31341-22-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Signed-off-by: Qian Cai
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    [cai@lca.pw: no stuck in __reset_isolation_pfn()]
    Link: http://lkml.kernel.org/r/20190206034732.75687-1-cai@lca.pw
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman