16 Dec, 2020
2 commits
-
ARM is the only architecture that defines CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
which in turn enables memmap_valid_within() function that is intended to
verify existence of struct page associated with a pfn when there are holes
in the memory map.However, the ARCH_HAS_HOLES_MEMORYMODEL also enables HAVE_ARCH_PFN_VALID
and arch-specific pfn_valid() implementation that also deals with the holes
in the memory map.The only two users of memmap_valid_within() call this function after
a call to pfn_valid() so the memmap_valid_within() check becomes redundant.Remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL and memmap_valid_within() and rely
entirely on ARM's implementation of pfn_valid() that is now enabled
unconditionally.Link: https://lkml.kernel.org/r/20201101170454.9567-9-rppt@kernel.org
Signed-off-by: Mike Rapoport
Cc: Alexey Dobriyan
Cc: Catalin Marinas
Cc: Geert Uytterhoeven
Cc: Greg Ungerer
Cc: John Paul Adrian Glaubitz
Cc: Jonathan Corbet
Cc: Matt Turner
Cc: Meelis Roos
Cc: Michael Schmitz
Cc: Russell King
Cc: Tony Luck
Cc: Vineet Gupta
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
For many workloads, pagetable consumption is significant and it makes
sense to expose it in the memory.stat for the memory cgroups. However at
the moment, the pagetables are accounted per-zone. Converting them to
per-node and using the right interface will correctly account for the
memory cgroups as well.[akpm@linux-foundation.org: export __mod_lruvec_page_state to modules for arch/mips/kvm/]
Link: https://lkml.kernel.org/r/20201130212541.2781790-3-shakeelb@google.com
Signed-off-by: Shakeel Butt
Acked-by: Johannes Weiner
Acked-by: Roman Gushchin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Oct, 2020
1 commit
-
Use helper macro abs() to simplify the "x > t || x < -t" cmp.
Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: https://lkml.kernel.org/r/20200905084008.15748-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds
05 Sep, 2020
2 commits
-
Merge emailed patches from Peter Xu:
"This is a small series that I picked up from Linus's suggestion to
simplify cow handling (and also make it more strict) by checking
against page refcounts rather than mapcounts.This makes uffd-wp work again (verified by running upmapsort)"
Note: this is horrendously bad timing, and making this kind of
fundamental vm change after -rc3 is not at all how things should work.
The saving grace is that it really is a a nice simplification:8 files changed, 29 insertions(+), 120 deletions(-)
The reason for the bad timing is that it turns out that commit
17839856fd58 ("gup: document and work around 'COW can break either way'
issue" broke not just UFFD functionality (as Peter noticed), but Mikulas
Patocka also reports that it caused issues for strace when running in a
DAX environment with ext4 on a persistent memory setup.And we can't just revert that commit without re-introducing the original
issue that is a potential security hole, so making COW stricter (and in
the process much simpler) is a step to then undoing the forced COW that
broke other uses.Link: https://lore.kernel.org/lkml/alpine.LRH.2.02.2009031328040.6929@file01.intranet.prod.int.rdu2.redhat.com/
* emailed patches from Peter Xu :
mm: Add PGREUSE counter
mm/gup: Remove enfornced COW mechanism
mm/ksm: Remove reuse_ksm_page()
mm: do_wp_page() simplification -
This accounts for wp_page_reuse() case, where we reused a page for COW.
Signed-off-by: Peter Xu
Signed-off-by: Linus Torvalds
15 Aug, 2020
1 commit
-
This reverts commit 26e7deadaae175.
Sonny reported that one of their tests started failing on the latest
kernel on their Chrome OS platform. The root cause is that the above
commit removed the protection line of empty zone, while the parser used in
the test relies on the protection line to mark the end of each zone.Let's revert it to avoid breaking userspace testing or applications.
Fixes: 26e7deadaae175 ("mm/vmstat.c: do not show lowmem reserve protection information of empty zone)"
Reported-by: Sonny Rao
Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: David Rientjes
Cc: [5.8.x]
Link: http://lkml.kernel.org/r/20200811075412.12872-1-bhe@redhat.com
Signed-off-by: Linus Torvalds
13 Aug, 2020
4 commits
-
Add following new vmstat events which will help in validating THP
migration without split. Statistics reported through these new VM events
will help in performance debugging.1. THP_MIGRATION_SUCCESS
2. THP_MIGRATION_FAILURE
3. THP_MIGRATION_SPLITIn addition, these new events also update normal page migration statistics
appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE. While here,
this updates current trace event 'mm_migrate_pages' to accommodate now
available THP statistics.[akpm@linux-foundation.org: s/hpage_nr_pages/thp_nr_pages/]
[ziy@nvidia.com: v2]
Link: http://lkml.kernel.org/r/C5E3C65C-8253-4638-9D3C-71A61858BB8B@nvidia.com
[anshuman.khandual@arm.com: s/thp_nr_pages/hpage_nr_pages/]
Link: http://lkml.kernel.org/r/1594287583-16568-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: Anshuman Khandual
Signed-off-by: Zi Yan
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Cc: Hugh Dickins
Cc: Matthew Wilcox
Cc: Zi Yan
Cc: John Hubbard
Cc: Naoya Horiguchi
Link: http://lkml.kernel.org/r/1594080415-27924-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds -
Proactive compaction uses per-node/zone "fragmentation score" which is
always in range [0, 100], so use unsigned type of these scores as well as
for related constants.Signed-off-by: Nitin Gupta
Signed-off-by: Andrew Morton
Reviewed-by: Baoquan He
Cc: Luis Chamberlain
Cc: Kees Cook
Cc: Iurii Zaikin
Cc: Vlastimil Babka
Cc: Joonsoo Kim
Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds -
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within 98% of free memory could be allocated as hugepages)- With 5.6.0-rc3 + this patch, with proactiveness=20
sysctl -w vm.compaction_proactiveness=20
percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap./usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouchThe above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.Backoff behavior
================Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/Signed-off-by: Nitin Gupta
Signed-off-by: Andrew Morton
Tested-by: Oleksandr Natalenko
Reviewed-by: Vlastimil Babka
Reviewed-by: Khalid Aziz
Reviewed-by: Oleksandr Natalenko
Cc: Vlastimil Babka
Cc: Khalid Aziz
Cc: Michal Hocko
Cc: Mel Gorman
Cc: Matthew Wilcox
Cc: Mike Kravetz
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Nitin Gupta
Cc: Oleksandr Natalenko
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds -
To prepare the workingset detection for anon LRU, this patch splits
workingset event counters for refault, activate and restore into anon and
file variants, as well as the refaults counter in struct lruvec.Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds
08 Aug, 2020
2 commits
-
Currently the kernel stack is being accounted per-zone. There is no need
to do that. In addition due to being per-zone, memcg has to keep a
separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
node_stat_item. In addition localize the kernel stack stats updates to
account_kernel_stack().Signed-off-by: Shakeel Butt
Signed-off-by: Andrew Morton
Reviewed-by: Roman Gushchin
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
Signed-off-by: Linus Torvalds -
To implement per-object slab memory accounting, we need to convert slab
vmstat counters to bytes. Actually, out of 4 levels of counters: global,
per-node, per-memcg and per-lruvec only two last levels will require
byte-sized counters. It's because global and per-node counters will be
counting the number of slab pages, and per-memcg and per-lruvec will be
counting the amount of memory taken by charged slab objects.Converting all vmstat counters to bytes or even all slab counters to bytes
would introduce an additional overhead. So instead let's store global and
per-node counters in pages, and memcg and lruvec counters in bytes.To make the API clean all access helpers (both on the read and write
sides) are dealing with bytes.To avoid back-and-forth conversions a new flavor of read-side helpers is
introduced, which always returns values in pages: node_page_state_pages()
and global_node_page_state_pages().Actually new helpers are just reading raw values. Old helpers are simple
wrappers, which will complain on an attempt to read byte value, because at
the moment no one actually needs bytes.Thanks to Johannes Weiner for the idea of having the byte-sized API on top
of the page-sized internal storage.Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Cc: Christoph Lameter
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-3-guro@fb.com
Signed-off-by: Linus Torvalds
05 Jun, 2020
1 commit
-
Use DEFINE_SEQ_ATTRIBUTE macro to simplify the code.
Signed-off-by: Kefeng Wang
Signed-off-by: Andrew Morton
Cc: Anil S Keshavamurthy
Cc: "David S. Miller"
Cc: Greg KH
Cc: Ingo Molnar
Cc: Masami Hiramatsu
Cc: Al Viro
Link: http://lkml.kernel.org/r/20200509064031.181091-3-wangkefeng.wang@huawei.com
Signed-off-by: Linus Torvalds
04 Jun, 2020
4 commits
-
Merge more updates from Andrew Morton:
"More mm/ work, plenty more to comeSubsystems affected by this patch series: slub, memcg, gup, kasan,
pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
thp, mmap, kconfig"* akpm: (131 commits)
arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
riscv: support DEBUG_WX
mm: add DEBUG_WX support
drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
powerpc/mm: drop platform defined pmd_mknotpresent()
mm: thp: don't need to drain lru cache when splitting and mlocking THP
hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
sparc32: register memory occupied by kernel as memblock.memory
include/linux/memblock.h: fix minor typo and unclear comment
mm, mempolicy: fix up gup usage in lookup_node
tools/vm/page_owner_sort.c: filter out unneeded line
mm: swap: memcg: fix memcg stats for huge pages
mm: swap: fix vmstats for huge pages
mm: vmscan: limit the range of LRU type balancing
mm: vmscan: reclaim writepage is IO cost
mm: vmscan: determine anon/file pressure balance at the reclaim root
mm: balance LRU lists based on relative thrashing
mm: only count actual rotations as LRU reclaim cost
... -
Having statistics on pages scanned and pages reclaimed for both anon and
file pages makes it easier to evaluate changes to LRU balancing.While at it, clean up the stat-keeping mess for isolation, putback,
reclaim stats etc. a bit: first the physical LRU operation (isolation and
putback), followed by vmstats, reclaim_stats, and then vm events.Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Link: http://lkml.kernel.org/r/20200520232525.798933-3-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds -
Because the lowmem reserve protection of a zone can't tell anything if the
zone is empty, except of adding one more line in /proc/zoneinfo.Let's remove it from that zone's showing.
Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200402140113.3696-4-bhe@redhat.com
Signed-off-by: Linus Torvalds -
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...
03 Jun, 2020
2 commits
-
Merge updates from Andrew Morton:
"A few little subsystems and a start of a lot of MM patches.Subsystems affected by this patch series: squashfs, ocfs2, parisc,
vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
swap, memcg, pagemap, memory-failure, vmalloc, kasan"* emailed patches from Andrew Morton : (128 commits)
kasan: move kasan_report() into report.c
mm/mm_init.c: report kasan-tag information stored in page->flags
ubsan: entirely disable alignment checks under UBSAN_TRAP
kasan: fix clang compilation warning due to stack protector
x86/mm: remove vmalloc faulting
mm: remove vmalloc_sync_(un)mappings()
x86/mm/32: implement arch_sync_kernel_mappings()
x86/mm/64: implement arch_sync_kernel_mappings()
mm/ioremap: track which page-table levels were modified
mm/vmalloc: track which page-table levels were modified
mm: add functions to track page directory modifications
s390: use __vmalloc_node in stack_alloc
powerpc: use __vmalloc_node in alloc_vm_stack
arm64: use __vmalloc_node in arch_alloc_vmap_stack
mm: remove vmalloc_user_node_flags
mm: switch the test_vmalloc module to use __vmalloc_node
mm: remove __vmalloc_node_flags_caller
mm: remove both instances of __vmalloc_node_flags
mm: remove the prot argument to __vmalloc_node
mm: remove the pgprot argument to __vmalloc
... -
After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds. If the COMMIT fails, the page will be
re-written.These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count. This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g. releasepage() used to
send a COMMIT). However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit 919e3bd9a875
("NFS: Ensure we commit after writeback is complete")) it makes more
sense to treat them as writeback pages.So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
NR_WRITEBACK and WB_WRITEBACK.A particular effect of this change is that when
wb_check_background_flush() calls wb_over_bg_threshold(), the latter
will report 'true' a lot less often as the 'unstable' pages are no
longer considered 'dirty' (as there is nothing that writeback can do
about them anyway).Currently wb_check_background_flush() will trigger writeback to NFS even
when there are relatively few dirty pages (if there are lots of unstable
pages), this can result in small writes going to the server (10s of
Kilobytes rather than a Megabyte) which hurts throughput. With this
patch, there are fewer writes which are each larger on average.Where the NR_UNSTABLE_NFS count was included in statistics
virtual-files, the entry is retained, but the value is hard-coded as
zero. static trace points and warning printks which mentioned this
counter no longer report it.[akpm@linux-foundation.org: re-layout comment]
[akpm@linux-foundation.org: fix printk warning]
Signed-off-by: NeilBrown
Signed-off-by: Andrew Morton
Reviewed-by: Jan Kara
Reviewed-by: Christoph Hellwig
Acked-by: Trond Myklebust
Acked-by: Michal Hocko [mm]
Cc: Christoph Hellwig
Cc: Chuck Lever
Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds
15 May, 2020
1 commit
-
This change adds accounting for the memory allocated for shadow stacks.
Signed-off-by: Sami Tolvanen
Reviewed-by: Kees Cook
Acked-by: Will Deacon
Signed-off-by: Will Deacon
27 Apr, 2020
1 commit
-
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.Signed-off-by: Christoph Hellwig
Acked-by: Andrey Ignatov
Signed-off-by: Al Viro
08 Apr, 2020
2 commits
-
The thp_fault_fallback and thp_file_fallback vmstats are incremented if
either the hugepage allocation fails through the page allocator or the
hugepage charge fails through mem cgroup.This patch leaves this field untouched but adds two new fields,
thp_{fault,file}_fallback_charge, which is incremented only when the mem
cgroup charge fails.This distinguishes between attempted hugepage allocations that fail due to
fragmentation (or low memory conditions) and those that fail due to mem
cgroup limits. That can be used to determine the impact of fragmentation
on the system by excluding faults that failed due to memcg usage.Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Reviewed-by: Yang Shi
Acked-by: Kirill A. Shutemov
Cc: Mike Rapoport
Cc: Jeremy Cline
Cc: Andrea Arcangeli
Cc: Mike Kravetz
Cc: Michal Hocko
Cc: Vlastimil Babka
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061422070.7412@chino.kir.corp.google.com
Signed-off-by: Linus Torvalds -
The existing thp_fault_fallback indicates when thp attempts to allocate a
hugepage but fails, or if the hugepage cannot be charged to the mem cgroup
hierarchy.Extend this to shmem as well. Adds a new thp_file_fallback to complement
thp_file_alloc that gets incremented when a hugepage is attempted to be
allocated but fails, or if it cannot be charged to the mem cgroup
hierarchy.Additionally, remove the check for CONFIG_TRANSPARENT_HUGE_PAGECACHE from
shmem_alloc_hugepage() since it is only called with this configuration
option.Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Reviewed-by: Yang Shi
Acked-by: Kirill A. Shutemov
Cc: Mike Rapoport
Cc: Jeremy Cline
Cc: Andrea Arcangeli
Cc: Mike Kravetz
Cc: Michal Hocko
Cc: Vlastimil Babka
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061421240.7412@chino.kir.corp.google.com
Signed-off-by: Linus Torvalds
03 Apr, 2020
1 commit
-
Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
unpin_user_pages*(), we need some visibility into whether all of this is
working correctly.Add two new fields to /proc/vmstat:
nr_foll_pin_acquired
nr_foll_pin_releasedThese are documented in Documentation/core-api/pin_user_pages.rst. They
represent the number of pages (since boot time) that have been pinned
("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via
pin_user_pages*() and unpin_user_pages*().In the absence of long-running DMA or RDMA operations that hold pages
pinned, the above two fields will normally be equal to each other.Also: update Documentation/core-api/pin_user_pages.rst, to remove an
earlier (now confirmed untrue) claim about a performance problem with
/proc/vmstat.Also: update Documentation/core-api/pin_user_pages.rst to rename the new
/proc/vmstat entries, to the names listed here.Signed-off-by: John Hubbard
Signed-off-by: Andrew Morton
Reviewed-by: Jan Kara
Acked-by: Kirill A. Shutemov
Cc: Ira Weiny
Cc: Jérôme Glisse
Cc: "Matthew Wilcox (Oracle)"
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Dan Williams
Cc: Dave Chinner
Cc: Jason Gunthorpe
Cc: Jonathan Corbet
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Shuah Khan
Cc: Vlastimil Babka
Link: http://lkml.kernel.org/r/20200211001536.1027652-9-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds
05 Dec, 2019
2 commits
-
Use common names from vmstat array when possible. This gives not much
difference in code size for now, but should help in keeping interfaces
consistent.add/remove: 0/2 grow/shrink: 2/0 up/down: 70/-72 (-2)
Function old new delta
memory_stat_format 984 1050 +66
memcg_stat_show 957 961 +4
memcg1_event_names 32 - -32
mem_cgroup_lru_names 40 - -40
Total: Before=14485337, After=14485335, chg -0.00%Link: http://lkml.kernel.org/r/157113012508.453.80391533767219371.stgit@buzz
Signed-off-by: Konstantin Khlebnikov
Acked-by: Andrew Morton
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Statistics in vmstat is combined from counters with different structure,
but names for them are merged into one array.This patch adds trivial helpers to get name for each item:
const char *zone_stat_name(enum zone_stat_item item);
const char *numa_stat_name(enum numa_stat_item item);
const char *node_stat_name(enum node_stat_item item);
const char *writeback_stat_name(enum writeback_stat_item item);
const char *vm_event_name(enum vm_event_item item);Names for enum writeback_stat_item are folded in the middle of
vmstat_text so this patch moves declaration into header to calculate
offset of following items.Also this patch reuses piece of node stat names for lru list names:
const char *lru_list_name(enum lru_list lru);
This returns common lru list names: "inactive_anon", "active_anon",
"inactive_file", "active_file", "unevictable".[khlebnikov@yandex-team.ru: do not use size of vmstat_text as count of /proc/vmstat items]
Link: http://lkml.kernel.org/r/157152151769.4139.15423465513138349343.stgit@buzz
Link: https://lore.kernel.org/linux-mm/cd1c42ae-281f-c8a8-70ac-1d01d417b2e1@infradead.org/T/#u
Link: http://lkml.kernel.org/r/157113012325.453.562783073839432766.stgit@buzz
Signed-off-by: Konstantin Khlebnikov
Reviewed-by: Andrew Morton
Cc: Randy Dunlap
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Johannes Weiner
Cc: YueHaibing
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Nov, 2019
2 commits
-
pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
This is not really nice because it blocks both any interrupts on that
cpu and the page allocator. On large machines this might even trigger
the hard lockup detector.Considering the pagetypeinfo is a debugging tool we do not really need
exact numbers here. The primary reason to look at the outuput is to see
how pageblocks are spread among different migratetypes and low number of
pages is much more interesting therefore putting a bound on the number
of pages on the free_list sounds like a reasonable tradeoff.The new output will simply tell
[...]
Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648instead of
Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648The limit has been chosen arbitrary and it is a subject of a future
change should there be a need for that.While we are at it, also drop the zone lock after each free_list
iteration which will help with the IRQ and page allocator responsiveness
even further as the IRQ lock held time is always bound to those 100k
pages.[akpm@linux-foundation.org: tweak comment text, per David Hildenbrand]
Link: http://lkml.kernel.org/r/20191025072610.18526-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Suggested-by: Andrew Morton
Reviewed-by: Waiman Long
Acked-by: Vlastimil Babka
Acked-by: David Hildenbrand
Acked-by: Rafael Aquini
Acked-by: David Rientjes
Reviewed-by: Andrew Morton
Cc: Greg Kroah-Hartman
Cc: Jann Horn
Cc: Johannes Weiner
Cc: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: Roman Gushchin
Cc: Song Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
/proc/pagetypeinfo is a debugging tool to examine internal page
allocator state wrt to fragmentation. It is not very useful for any
other use so normal users really do not need to read this file.Waiman Long has noticed that reading this file can have negative side
effects because zone->lock is necessary for gathering data and that a)
interferes with the page allocator and its users and b) can lead to hard
lockups on large machines which have very long free_list.Reduce both issues by simply not exporting the file to regular users.
Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org
Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
Signed-off-by: Michal Hocko
Reported-by: Waiman Long
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Acked-by: Waiman Long
Acked-by: Rafael Aquini
Acked-by: David Rientjes
Reviewed-by: Andrew Morton
Cc: David Hildenbrand
Cc: Johannes Weiner
Cc: Roman Gushchin
Cc: Konstantin Khlebnikov
Cc: Jann Horn
Cc: Song Liu
Cc: Greg Kroah-Hartman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
25 Sep, 2019
1 commit
-
In preparation for non-shmem THP, this patch adds a few stats and exposes
them in /proc/meminfo, /sys/bus/node/devices//meminfo, and
/proc//task//smaps.This patch is mostly a rewrite of Kirill A. Shutemov's earlier version:
https://lkml.kernel.org/r/20170126115819.58875-5-kirill.shutemov@linux.intel.com/Link: http://lkml.kernel.org/r/20190801184244.3169074-5-songliubraving@fb.com
Signed-off-by: Song Liu
Acked-by: Rik van Riel
Acked-by: Kirill A. Shutemov
Acked-by: Johannes Weiner
Cc: Hillf Danton
Cc: Hugh Dickins
Cc: William Kucharski
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 May, 2019
1 commit
-
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the fileThese files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:GPL-2.0-only
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
20 Apr, 2019
1 commit
-
Commit 58bc4c34d249 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
depends on skipping vmstat entries with empty name introduced in
7aaf77272358 ("mm: don't show nr_indirectly_reclaimable in
/proc/vmstat") but reverted in b29940c1abd7 ("mm: rename and change
semantics of nr_indirectly_reclaimable_bytes").So skipping no longer works and /proc/vmstat has misformatted lines " 0".
This patch simply shows debug counters "nr_tlb_remote_*" for UP.
Link: http://lkml.kernel.org/r/155481488468.467.4295519102880913454.stgit@buzz
Fixes: 58bc4c34d249 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
Signed-off-by: Konstantin Khlebnikov
Acked-by: Vlastimil Babka
Cc: Roman Gushchin
Cc: Jann Horn
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Mar, 2019
1 commit
-
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: David Rientjes
Cc: Laura Abbott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Dec, 2018
1 commit
-
totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.This patch converts zone->managed_pages. Subsequent patches will convert
totalram_panges, totalhigh_pages and eventually managed_page_count_lock
will be removed.Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS
Suggested-by: Michal Hocko
Suggested-by: Vlastimil Babka
Reviewed-by: Konstantin Khlebnikov
Reviewed-by: David Hildenbrand
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Reviewed-by: Pavel Tatashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
19 Nov, 2018
1 commit
-
Scan through the whole array to see if an update is needed. While we're
at it, use sizeof() to be safe against any possible type changes in the
future.The bug here is that we wouldn't sync per-cpu counters into global ones
if there was an update of numa_stats for higher cpus. Highly
theoretical one though because it is much more probable that zone_stats
are updated so we would refresh anyway. So I wouldn't bother to mark
this for stable, yet something nice to fix.[mhocko@suse.com: changelog enhancement]
Link: http://lkml.kernel.org/r/1541601517-17282-1-git-send-email-janne.huttunen@nokia.com
Fixes: 1d90ca897cb0 ("mm: update NUMA counter threshold size")
Signed-off-by: Janne Huttunen
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Oct, 2018
4 commits
-
Having two gigantic arrays that must manually be kept in sync, including
ifdefs, isn't exactly robust. To make it easier to catch such issues in
the future, add a BUILD_BUG_ON().Link: http://lkml.kernel.org/r/20181001143138.95119-3-jannh@google.com
Signed-off-by: Jann Horn
Reviewed-by: Kees Cook
Reviewed-by: Andrew Morton
Acked-by: Roman Gushchin
Acked-by: Michal Hocko
Cc: Davidlohr Bueso
Cc: Oleg Nesterov
Cc: Christoph Lameter
Cc: Kemi Wang
Cc: Andy Lutomirski
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make it easier to catch bugs in the shadow node shrinker by adding a
counter for the shadow nodes in circulation.[akpm@linux-foundation.org: assert that irqs are disabled, for __inc_lruvec_page_state()]
[akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/, per Johannes]
Link: http://lkml.kernel.org/r/20181009184732.762-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Andrew Morton
Acked-by: Peter Zijlstra (Intel)
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Peter Zijlstra (Intel)
Tested-by: Daniel Drake
Tested-by: Suren Baghdasaryan
Cc: Christopher Lameter
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Mike Galbraith
Cc: Peter Enderborg
Cc: Randy Dunlap
Cc: Shakeel Butt
Cc: Tejun Heo
Cc: Vinayak Menon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by
commit eb59254608bc ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with
the goal of accounting objects that can be reclaimed, but cannot be
allocated via a SLAB_RECLAIM_ACCOUNT cache. This is now possible via
kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user
is converted.The counter is however still useful for accounting direct page allocations
(i.e. not slab) with a shrinker, such as the ION page pool. So keep it,
and:- change granularity to pages to be more like other counters; sub-page
allocations should be able to use kmalloc
- rename the counter to NR_KERNEL_MISC_RECLAIMABLE
- expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can
again remove the check for not printing "hidden" countersLink: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Christoph Lameter
Acked-by: Roman Gushchin
Cc: Vijayanand Jitta
Cc: Laura Abbott
Cc: Sumit Semwal
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Oct, 2018
2 commits
-
5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even
on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside
the kernel unconditional to reduce #ifdef soup, but (either to avoid
showing dummy zero counters to userspace, or because that code was missed)
didn't update the vmstat_array, meaning that all following counters would
be shown with incorrect values.This only affects kernel builds with
CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n.Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com
Fixes: 5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP")
Signed-off-by: Jann Horn
Reviewed-by: Kees Cook
Reviewed-by: Andrew Morton
Acked-by: Michal Hocko
Acked-by: Roman Gushchin
Cc: Davidlohr Bueso
Cc: Oleg Nesterov
Cc: Christoph Lameter
Cc: Kemi Wang
Cc: Andy Lutomirski
Cc: Ingo Molnar
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman -
7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely") removed the
VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding
entry in vmstat_text. This causes an out-of-bounds access in
vmstat_show().Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which
is probably very rare.Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com
Fixes: 7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely")
Signed-off-by: Jann Horn
Reviewed-by: Kees Cook
Reviewed-by: Andrew Morton
Acked-by: Michal Hocko
Acked-by: Roman Gushchin
Cc: Davidlohr Bueso
Cc: Oleg Nesterov
Cc: Christoph Lameter
Cc: Kemi Wang
Cc: Andy Lutomirski
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman