26 Oct, 2020

1 commit


17 Oct, 2020

1 commit

  • Use helper macro abs() to simplify the "x > t || x < -t" cmp.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200905084008.15748-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

07 Sep, 2020

1 commit


05 Sep, 2020

2 commits

  • Merge emailed patches from Peter Xu:
    "This is a small series that I picked up from Linus's suggestion to
    simplify cow handling (and also make it more strict) by checking
    against page refcounts rather than mapcounts.

    This makes uffd-wp work again (verified by running upmapsort)"

    Note: this is horrendously bad timing, and making this kind of
    fundamental vm change after -rc3 is not at all how things should work.
    The saving grace is that it really is a a nice simplification:

    8 files changed, 29 insertions(+), 120 deletions(-)

    The reason for the bad timing is that it turns out that commit
    17839856fd58 ("gup: document and work around 'COW can break either way'
    issue" broke not just UFFD functionality (as Peter noticed), but Mikulas
    Patocka also reports that it caused issues for strace when running in a
    DAX environment with ext4 on a persistent memory setup.

    And we can't just revert that commit without re-introducing the original
    issue that is a potential security hole, so making COW stricter (and in
    the process much simpler) is a step to then undoing the forced COW that
    broke other uses.

    Link: https://lore.kernel.org/lkml/alpine.LRH.2.02.2009031328040.6929@file01.intranet.prod.int.rdu2.redhat.com/

    * emailed patches from Peter Xu :
    mm: Add PGREUSE counter
    mm/gup: Remove enfornced COW mechanism
    mm/ksm: Remove reuse_ksm_page()
    mm: do_wp_page() simplification

    Linus Torvalds
     
  • This accounts for wp_page_reuse() case, where we reused a page for COW.

    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     

17 Aug, 2020

1 commit


15 Aug, 2020

1 commit

  • This reverts commit 26e7deadaae175.

    Sonny reported that one of their tests started failing on the latest
    kernel on their Chrome OS platform. The root cause is that the above
    commit removed the protection line of empty zone, while the parser used in
    the test relies on the protection line to mark the end of each zone.

    Let's revert it to avoid breaking userspace testing or applications.

    Fixes: 26e7deadaae175 ("mm/vmstat.c: do not show lowmem reserve protection information of empty zone)"
    Reported-by: Sonny Rao
    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: David Rientjes
    Cc: [5.8.x]
    Link: http://lkml.kernel.org/r/20200811075412.12872-1-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     

13 Aug, 2020

5 commits

  • …ernel/git/abelloni/linux") into android-mainline

    Steps on the way to 5.9-rc1.

    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: Iceded779988ff472863b7e1c54e22a9fa6383a30

    Greg Kroah-Hartman
     
  • Add following new vmstat events which will help in validating THP
    migration without split. Statistics reported through these new VM events
    will help in performance debugging.

    1. THP_MIGRATION_SUCCESS
    2. THP_MIGRATION_FAILURE
    3. THP_MIGRATION_SPLIT

    In addition, these new events also update normal page migration statistics
    appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE. While here,
    this updates current trace event 'mm_migrate_pages' to accommodate now
    available THP statistics.

    [akpm@linux-foundation.org: s/hpage_nr_pages/thp_nr_pages/]
    [ziy@nvidia.com: v2]
    Link: http://lkml.kernel.org/r/C5E3C65C-8253-4638-9D3C-71A61858BB8B@nvidia.com
    [anshuman.khandual@arm.com: s/thp_nr_pages/hpage_nr_pages/]
    Link: http://lkml.kernel.org/r/1594287583-16568-1-git-send-email-anshuman.khandual@arm.com

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Zi Yan
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Zi Yan
    Cc: John Hubbard
    Cc: Naoya Horiguchi
    Link: http://lkml.kernel.org/r/1594080415-27924-1-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Proactive compaction uses per-node/zone "fragmentation score" which is
    always in range [0, 100], so use unsigned type of these scores as well as
    for related constants.

    Signed-off-by: Nitin Gupta
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Cc: Luis Chamberlain
    Cc: Kees Cook
    Cc: Iurii Zaikin
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com
    Signed-off-by: Linus Torvalds

    Nitin Gupta
     
  • For some applications, we need to allocate almost all memory as hugepages.
    However, on a running system, higher-order allocations can fail if the
    memory is fragmented. Linux kernel currently does on-demand compaction as
    we request more hugepages, but this style of compaction incurs very high
    latency. Experiments with one-time full memory compaction (followed by
    hugepage allocations) show that kernel is able to restore a highly
    fragmented memory state to a fairly compacted memory state within 98% of free memory could be allocated as hugepages)

    - With 5.6.0-rc3 + this patch, with proactiveness=20

    sysctl -w vm.compaction_proactiveness=20

    percentile latency
    –––––––––– –––––––
    5 2
    10 2
    25 3
    30 3
    40 3
    50 4
    60 4
    75 4
    80 4
    90 5
    95 429

    Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
    total free => 98% of free memory could be allocated as hugepages)

    2. JAVA heap allocation

    In this test, we first fragment memory using the same method as for (1).

    Then, we start a Java process with a heap size set to 700G and request the
    heap to be allocated with THP hugepages. We also set THP to madvise to
    allow hugepage backing of this heap.

    /usr/bin/time
    java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

    The above command allocates 700G of Java heap using hugepages.

    - With vanilla 5.6.0-rc3

    17.39user 1666.48system 27:37.89elapsed

    - With 5.6.0-rc3 + this patch, with proactiveness=20

    8.35user 194.58system 3:19.62elapsed

    Elapsed time remains around 3:15, as proactiveness is further increased.

    Note that proactive compaction happens throughout the runtime of these
    workloads. The situation of one-time compaction, sufficient to supply
    hugepages for following allocation stream, can probably happen for more
    extreme proactiveness values, like 80 or 90.

    In the above Java workload, proactiveness is set to 20. The test starts
    with a node's score of 80 or higher, depending on the delay between the
    fragmentation step and starting the benchmark, which gives more-or-less
    time for the initial round of compaction. As t he benchmark consumes
    hugepages, node's score quickly rises above the high threshold (90) and
    proactive compaction starts again, which brings down the score to the low
    threshold level (80). Repeat.

    bpftrace also confirms proactive compaction running 20+ times during the
    runtime of this Java benchmark. kcompactd threads consume 100% of one of
    the CPUs while it tries to bring a node's score within thresholds.

    Backoff behavior
    ================

    Above workloads produce a memory state which is easy to compact. However,
    if memory is filled with unmovable pages, proactive compaction should
    essentially back off. To test this aspect:

    - Created a kernel driver that allocates almost all memory as hugepages
    followed by freeing first 3/4 of each hugepage.
    - Set proactiveness=40
    - Note that proactive_compact_node() is deferred maximum number of times
    with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
    (=> ~30 seconds between retries).

    [1] https://patchwork.kernel.org/patch/11098289/
    [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
    [3] https://lwn.net/Articles/817905/

    Signed-off-by: Nitin Gupta
    Signed-off-by: Andrew Morton
    Tested-by: Oleksandr Natalenko
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Khalid Aziz
    Reviewed-by: Oleksandr Natalenko
    Cc: Vlastimil Babka
    Cc: Khalid Aziz
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Mike Kravetz
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Nitin Gupta
    Cc: Oleksandr Natalenko
    Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
    Signed-off-by: Linus Torvalds

    Nitin Gupta
     
  • To prepare the workingset detection for anon LRU, this patch splits
    workingset event counters for refault, activate and restore into anon and
    file variants, as well as the refaults counter in struct lruvec.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

08 Aug, 2020

3 commits

  • …kernel/git/sre/linux-power-supply") into android-mainline

    Merges along the way to 5.9-rc1

    resolves conflicts in:
    Documentation/ABI/testing/sysfs-class-power
    drivers/power/supply/power_supply_sysfs.c
    fs/crypto/inline_crypt.c

    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: Ia087834f54fb4e5269d68c3c404747ceed240701

    Greg Kroah-Hartman
     
  • Currently the kernel stack is being accounted per-zone. There is no need
    to do that. In addition due to being per-zone, memcg has to keep a
    separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
    MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
    node_stat_item. In addition localize the kernel stack stats updates to
    account_kernel_stack().

    Signed-off-by: Shakeel Butt
    Signed-off-by: Andrew Morton
    Reviewed-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • To implement per-object slab memory accounting, we need to convert slab
    vmstat counters to bytes. Actually, out of 4 levels of counters: global,
    per-node, per-memcg and per-lruvec only two last levels will require
    byte-sized counters. It's because global and per-node counters will be
    counting the number of slab pages, and per-memcg and per-lruvec will be
    counting the amount of memory taken by charged slab objects.

    Converting all vmstat counters to bytes or even all slab counters to bytes
    would introduce an additional overhead. So instead let's store global and
    per-node counters in pages, and memcg and lruvec counters in bytes.

    To make the API clean all access helpers (both on the read and write
    sides) are dealing with bytes.

    To avoid back-and-forth conversions a new flavor of read-side helpers is
    introduced, which always returns values in pages: node_page_state_pages()
    and global_node_page_state_pages().

    Actually new helpers are just reading raw values. Old helpers are simple
    wrappers, which will complain on an attempt to read byte value, because at
    the moment no one actually needs bytes.

    Thanks to Johannes Weiner for the idea of having the byte-sized API on top
    of the page-sized internal storage.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-3-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

24 Jul, 2020

1 commit


05 Jun, 2020

1 commit

  • Use DEFINE_SEQ_ATTRIBUTE macro to simplify the code.

    Signed-off-by: Kefeng Wang
    Signed-off-by: Andrew Morton
    Cc: Anil S Keshavamurthy
    Cc: "David S. Miller"
    Cc: Greg KH
    Cc: Ingo Molnar
    Cc: Masami Hiramatsu
    Cc: Al Viro
    Link: http://lkml.kernel.org/r/20200509064031.181091-3-wangkefeng.wang@huawei.com
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     

04 Jun, 2020

4 commits

  • Merge more updates from Andrew Morton:
    "More mm/ work, plenty more to come

    Subsystems affected by this patch series: slub, memcg, gup, kasan,
    pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
    thp, mmap, kconfig"

    * akpm: (131 commits)
    arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
    x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
    riscv: support DEBUG_WX
    mm: add DEBUG_WX support
    drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
    mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
    powerpc/mm: drop platform defined pmd_mknotpresent()
    mm: thp: don't need to drain lru cache when splitting and mlocking THP
    hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
    sparc32: register memory occupied by kernel as memblock.memory
    include/linux/memblock.h: fix minor typo and unclear comment
    mm, mempolicy: fix up gup usage in lookup_node
    tools/vm/page_owner_sort.c: filter out unneeded line
    mm: swap: memcg: fix memcg stats for huge pages
    mm: swap: fix vmstats for huge pages
    mm: vmscan: limit the range of LRU type balancing
    mm: vmscan: reclaim writepage is IO cost
    mm: vmscan: determine anon/file pressure balance at the reclaim root
    mm: balance LRU lists based on relative thrashing
    mm: only count actual rotations as LRU reclaim cost
    ...

    Linus Torvalds
     
  • Having statistics on pages scanned and pages reclaimed for both anon and
    file pages makes it easier to evaluate changes to LRU balancing.

    While at it, clean up the stat-keeping mess for isolation, putback,
    reclaim stats etc. a bit: first the physical LRU operation (isolation and
    putback), followed by vmstats, reclaim_stats, and then vm events.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-3-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Because the lowmem reserve protection of a zone can't tell anything if the
    zone is empty, except of adding one more line in /proc/zoneinfo.

    Let's remove it from that zone's showing.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200402140113.3696-4-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Pull networking updates from David Miller:

    1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

    2) Add GSO partial support to igc, from Sasha Neftin.

    3) Several cleanups and improvements to r8169 from Heiner Kallweit.

    4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

    5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

    6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

    7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

    8) Add sriov and vf support to hinic, from Luo bin.

    9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

    10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

    11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

    12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

    13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

    14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

    15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

    16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

    17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

    18) Several RISCV bpf jit optimizations, from Luke Nelson.

    19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

    20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

    21) Add BPF iterators, from Yonghang Song.

    22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

    23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

    24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

    25) Add CAP_BPF, from Alexei Starovoitov.

    26) Support terse dumps in the packet scheduler, from Vlad Buslov.

    27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

    28) Add devm_register_netdev(), from Bartosz Golaszewski.

    29) Minimize qdisc resets, from Cong Wang.

    30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
    selftests: net: ip_defrag: ignore EPERM
    net_failover: fixed rollback in net_failover_open()
    Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
    Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
    vmxnet3: allow rx flow hash ops only when rss is enabled
    hinic: add set_channels ethtool_ops support
    selftests/bpf: Add a default $(CXX) value
    tools/bpf: Don't use $(COMPILE.c)
    bpf, selftests: Use bpf_probe_read_kernel
    s390/bpf: Use bcr 0,%0 as tail call nop filler
    s390/bpf: Maintain 8-byte stack alignment
    selftests/bpf: Fix verifier test
    selftests/bpf: Fix sample_cnt shared between two threads
    bpf, selftests: Adapt cls_redirect to call csum_level helper
    bpf: Add csum_level helper for fixing up csum levels
    bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
    sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
    crypto/chtls: IPv6 support for inline TLS
    Crypto/chcr: Fixes a coccinile check error
    Crypto/chcr: Fixes compilations warnings
    ...

    Linus Torvalds
     

03 Jun, 2020

2 commits

  • Merge updates from Andrew Morton:
    "A few little subsystems and a start of a lot of MM patches.

    Subsystems affected by this patch series: squashfs, ocfs2, parisc,
    vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
    swap, memcg, pagemap, memory-failure, vmalloc, kasan"

    * emailed patches from Andrew Morton : (128 commits)
    kasan: move kasan_report() into report.c
    mm/mm_init.c: report kasan-tag information stored in page->flags
    ubsan: entirely disable alignment checks under UBSAN_TRAP
    kasan: fix clang compilation warning due to stack protector
    x86/mm: remove vmalloc faulting
    mm: remove vmalloc_sync_(un)mappings()
    x86/mm/32: implement arch_sync_kernel_mappings()
    x86/mm/64: implement arch_sync_kernel_mappings()
    mm/ioremap: track which page-table levels were modified
    mm/vmalloc: track which page-table levels were modified
    mm: add functions to track page directory modifications
    s390: use __vmalloc_node in stack_alloc
    powerpc: use __vmalloc_node in alloc_vm_stack
    arm64: use __vmalloc_node in arch_alloc_vmap_stack
    mm: remove vmalloc_user_node_flags
    mm: switch the test_vmalloc module to use __vmalloc_node
    mm: remove __vmalloc_node_flags_caller
    mm: remove both instances of __vmalloc_node_flags
    mm: remove the prot argument to __vmalloc_node
    mm: remove the pgprot argument to __vmalloc
    ...

    Linus Torvalds
     
  • After an NFS page has been written it is considered "unstable" until a
    COMMIT request succeeds. If the COMMIT fails, the page will be
    re-written.

    These "unstable" pages are currently accounted as "reclaimable", either
    in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
    'reclaimable' count. This might have made sense when sending the COMMIT
    required a separate action by the VFS/MM (e.g. releasepage() used to
    send a COMMIT). However now that all writes generated by ->writepages()
    will automatically be followed by a COMMIT (since commit 919e3bd9a875
    ("NFS: Ensure we commit after writeback is complete")) it makes more
    sense to treat them as writeback pages.

    So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
    NR_WRITEBACK and WB_WRITEBACK.

    A particular effect of this change is that when
    wb_check_background_flush() calls wb_over_bg_threshold(), the latter
    will report 'true' a lot less often as the 'unstable' pages are no
    longer considered 'dirty' (as there is nothing that writeback can do
    about them anyway).

    Currently wb_check_background_flush() will trigger writeback to NFS even
    when there are relatively few dirty pages (if there are lots of unstable
    pages), this can result in small writes going to the server (10s of
    Kilobytes rather than a Megabyte) which hurts throughput. With this
    patch, there are fewer writes which are each larger on average.

    Where the NR_UNSTABLE_NFS count was included in statistics
    virtual-files, the entry is retained, but the value is hard-coded as
    zero. static trace points and warning printks which mentioned this
    counter no longer report it.

    [akpm@linux-foundation.org: re-layout comment]
    [akpm@linux-foundation.org: fix printk warning]
    Signed-off-by: NeilBrown
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Acked-by: Trond Myklebust
    Acked-by: Michal Hocko [mm]
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name
    Signed-off-by: Linus Torvalds

    NeilBrown
     

15 May, 2020

1 commit


27 Apr, 2020

1 commit

  • Instead of having all the sysctl handlers deal with user pointers, which
    is rather hairy in terms of the BPF interaction, copy the input to and
    from userspace in common code. This also means that the strings are
    always NUL-terminated by the common code, making the API a little bit
    safer.

    As most handler just pass through the data to one of the common handlers
    a lot of the changes are mechnical.

    Signed-off-by: Christoph Hellwig
    Acked-by: Andrey Ignatov
    Signed-off-by: Al Viro

    Christoph Hellwig
     

08 Apr, 2020

2 commits

  • The thp_fault_fallback and thp_file_fallback vmstats are incremented if
    either the hugepage allocation fails through the page allocator or the
    hugepage charge fails through mem cgroup.

    This patch leaves this field untouched but adds two new fields,
    thp_{fault,file}_fallback_charge, which is incremented only when the mem
    cgroup charge fails.

    This distinguishes between attempted hugepage allocations that fail due to
    fragmentation (or low memory conditions) and those that fail due to mem
    cgroup limits. That can be used to determine the impact of fragmentation
    on the system by excluding faults that failed due to memcg usage.

    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Cc: Mike Rapoport
    Cc: Jeremy Cline
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061422070.7412@chino.kir.corp.google.com
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The existing thp_fault_fallback indicates when thp attempts to allocate a
    hugepage but fails, or if the hugepage cannot be charged to the mem cgroup
    hierarchy.

    Extend this to shmem as well. Adds a new thp_file_fallback to complement
    thp_file_alloc that gets incremented when a hugepage is attempted to be
    allocated but fails, or if it cannot be charged to the mem cgroup
    hierarchy.

    Additionally, remove the check for CONFIG_TRANSPARENT_HUGE_PAGECACHE from
    shmem_alloc_hugepage() since it is only called with this configuration
    option.

    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Cc: Mike Rapoport
    Cc: Jeremy Cline
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061421240.7412@chino.kir.corp.google.com
    Signed-off-by: Linus Torvalds

    David Rientjes
     

03 Apr, 2020

1 commit

  • Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
    unpin_user_pages*(), we need some visibility into whether all of this is
    working correctly.

    Add two new fields to /proc/vmstat:

    nr_foll_pin_acquired
    nr_foll_pin_released

    These are documented in Documentation/core-api/pin_user_pages.rst. They
    represent the number of pages (since boot time) that have been pinned
    ("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via
    pin_user_pages*() and unpin_user_pages*().

    In the absence of long-running DMA or RDMA operations that hold pages
    pinned, the above two fields will normally be equal to each other.

    Also: update Documentation/core-api/pin_user_pages.rst, to remove an
    earlier (now confirmed untrue) claim about a performance problem with
    /proc/vmstat.

    Also: update Documentation/core-api/pin_user_pages.rst to rename the new
    /proc/vmstat entries, to the names listed here.

    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Ira Weiny
    Cc: Jérôme Glisse
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Jason Gunthorpe
    Cc: Jonathan Corbet
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Shuah Khan
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200211001536.1027652-9-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     

05 Dec, 2019

2 commits

  • Use common names from vmstat array when possible. This gives not much
    difference in code size for now, but should help in keeping interfaces
    consistent.

    add/remove: 0/2 grow/shrink: 2/0 up/down: 70/-72 (-2)
    Function old new delta
    memory_stat_format 984 1050 +66
    memcg_stat_show 957 961 +4
    memcg1_event_names 32 - -32
    mem_cgroup_lru_names 40 - -40
    Total: Before=14485337, After=14485335, chg -0.00%

    Link: http://lkml.kernel.org/r/157113012508.453.80391533767219371.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Statistics in vmstat is combined from counters with different structure,
    but names for them are merged into one array.

    This patch adds trivial helpers to get name for each item:

    const char *zone_stat_name(enum zone_stat_item item);
    const char *numa_stat_name(enum numa_stat_item item);
    const char *node_stat_name(enum node_stat_item item);
    const char *writeback_stat_name(enum writeback_stat_item item);
    const char *vm_event_name(enum vm_event_item item);

    Names for enum writeback_stat_item are folded in the middle of
    vmstat_text so this patch moves declaration into header to calculate
    offset of following items.

    Also this patch reuses piece of node stat names for lru list names:

    const char *lru_list_name(enum lru_list lru);

    This returns common lru list names: "inactive_anon", "active_anon",
    "inactive_file", "active_file", "unevictable".

    [khlebnikov@yandex-team.ru: do not use size of vmstat_text as count of /proc/vmstat items]
    Link: http://lkml.kernel.org/r/157152151769.4139.15423465513138349343.stgit@buzz
    Link: https://lore.kernel.org/linux-mm/cd1c42ae-281f-c8a8-70ac-1d01d417b2e1@infradead.org/T/#u
    Link: http://lkml.kernel.org/r/157113012325.453.562783073839432766.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Andrew Morton
    Cc: Randy Dunlap
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: YueHaibing
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

07 Nov, 2019

2 commits

  • pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
    This is not really nice because it blocks both any interrupts on that
    cpu and the page allocator. On large machines this might even trigger
    the hard lockup detector.

    Considering the pagetypeinfo is a debugging tool we do not really need
    exact numbers here. The primary reason to look at the outuput is to see
    how pageblocks are spread among different migratetypes and low number of
    pages is much more interesting therefore putting a bound on the number
    of pages on the free_list sounds like a reasonable tradeoff.

    The new output will simply tell
    [...]
    Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648

    instead of
    Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648

    The limit has been chosen arbitrary and it is a subject of a future
    change should there be a need for that.

    While we are at it, also drop the zone lock after each free_list
    iteration which will help with the IRQ and page allocator responsiveness
    even further as the IRQ lock held time is always bound to those 100k
    pages.

    [akpm@linux-foundation.org: tweak comment text, per David Hildenbrand]
    Link: http://lkml.kernel.org/r/20191025072610.18526-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Andrew Morton
    Reviewed-by: Waiman Long
    Acked-by: Vlastimil Babka
    Acked-by: David Hildenbrand
    Acked-by: Rafael Aquini
    Acked-by: David Rientjes
    Reviewed-by: Andrew Morton
    Cc: Greg Kroah-Hartman
    Cc: Jann Horn
    Cc: Johannes Weiner
    Cc: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: Roman Gushchin
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • /proc/pagetypeinfo is a debugging tool to examine internal page
    allocator state wrt to fragmentation. It is not very useful for any
    other use so normal users really do not need to read this file.

    Waiman Long has noticed that reading this file can have negative side
    effects because zone->lock is necessary for gathering data and that a)
    interferes with the page allocator and its users and b) can lead to hard
    lockups on large machines which have very long free_list.

    Reduce both issues by simply not exporting the file to regular users.

    Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org
    Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
    Signed-off-by: Michal Hocko
    Reported-by: Waiman Long
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Waiman Long
    Acked-by: Rafael Aquini
    Acked-by: David Rientjes
    Reviewed-by: Andrew Morton
    Cc: David Hildenbrand
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Cc: Konstantin Khlebnikov
    Cc: Jann Horn
    Cc: Song Liu
    Cc: Greg Kroah-Hartman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Sep, 2019

1 commit

  • In preparation for non-shmem THP, this patch adds a few stats and exposes
    them in /proc/meminfo, /sys/bus/node/devices//meminfo, and
    /proc//task//smaps.

    This patch is mostly a rewrite of Kirill A. Shutemov's earlier version:
    https://lkml.kernel.org/r/20170126115819.58875-5-kirill.shutemov@linux.intel.com/

    Link: http://lkml.kernel.org/r/20190801184244.3169074-5-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

20 Apr, 2019

1 commit

  • Commit 58bc4c34d249 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
    depends on skipping vmstat entries with empty name introduced in
    7aaf77272358 ("mm: don't show nr_indirectly_reclaimable in
    /proc/vmstat") but reverted in b29940c1abd7 ("mm: rename and change
    semantics of nr_indirectly_reclaimable_bytes").

    So skipping no longer works and /proc/vmstat has misformatted lines " 0".

    This patch simply shows debug counters "nr_tlb_remote_*" for UP.

    Link: http://lkml.kernel.org/r/155481488468.467.4295519102880913454.stgit@buzz
    Fixes: 58bc4c34d249 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Vlastimil Babka
    Cc: Roman Gushchin
    Cc: Jann Horn
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

06 Mar, 2019

1 commit

  • When calling debugfs functions, there is no need to ever check the
    return value. The function can work or not, but the code logic should
    never do something different based on this.

    Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org
    Signed-off-by: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Kroah-Hartman
     

29 Dec, 2018

1 commit

  • totalram_pages, zone->managed_pages and totalhigh_pages updates are
    protected by managed_page_count_lock, but readers never care about it.
    Convert these variables to atomic to avoid readers potentially seeing a
    store tear.

    This patch converts zone->managed_pages. Subsequent patches will convert
    totalram_panges, totalhigh_pages and eventually managed_page_count_lock
    will be removed.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

19 Nov, 2018

1 commit

  • Scan through the whole array to see if an update is needed. While we're
    at it, use sizeof() to be safe against any possible type changes in the
    future.

    The bug here is that we wouldn't sync per-cpu counters into global ones
    if there was an update of numa_stats for higher cpus. Highly
    theoretical one though because it is much more probable that zone_stats
    are updated so we would refresh anyway. So I wouldn't bother to mark
    this for stable, yet something nice to fix.

    [mhocko@suse.com: changelog enhancement]
    Link: http://lkml.kernel.org/r/1541601517-17282-1-git-send-email-janne.huttunen@nokia.com
    Fixes: 1d90ca897cb0 ("mm: update NUMA counter threshold size")
    Signed-off-by: Janne Huttunen
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Janne Huttunen
     

27 Oct, 2018

2 commits

  • Having two gigantic arrays that must manually be kept in sync, including
    ifdefs, isn't exactly robust. To make it easier to catch such issues in
    the future, add a BUILD_BUG_ON().

    Link: http://lkml.kernel.org/r/20181001143138.95119-3-jannh@google.com
    Signed-off-by: Jann Horn
    Reviewed-by: Kees Cook
    Reviewed-by: Andrew Morton
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Kemi Wang
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Make it easier to catch bugs in the shadow node shrinker by adding a
    counter for the shadow nodes in circulation.

    [akpm@linux-foundation.org: assert that irqs are disabled, for __inc_lruvec_page_state()]
    [akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/, per Johannes]
    Link: http://lkml.kernel.org/r/20181009184732.762-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Andrew Morton
    Acked-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner