27 Apr, 2022
1 commit
-
commit 9b3016154c913b2e7ec5ae5c9a42eb9e732d86aa upstream.
Daniel Dao has reported [1] a regression on workloads that may trigger a
lot of refaults (anon and file). The underlying issue is that flushing
rstat is expensive. Although rstat flush are batched with (nr_cpus *
MEMCG_BATCH) stat updates, it seems like there are workloads which
genuinely do stat updates larger than batch value within short amount of
time. Since the rstat flush can happen in the performance critical
codepaths like page faults, such workload can suffer greatly.This patch fixes this regression by making the rstat flushing
conditional in the performance critical codepaths. More specifically,
the kernel relies on the async periodic rstat flusher to flush the stats
and only if the periodic flusher is delayed by more than twice the
amount of its normal time window then the kernel allows rstat flushing
from the performance critical codepaths.Now the question: what are the side-effects of this change? The worst
that can happen is the refault codepath will see 4sec old lruvec stats
and may cause false (or missed) activations of the refaulted page which
may under-or-overestimate the workingset size. Though that is not very
concerning as the kernel can already miss or do false activations.There are two more codepaths whose flushing behavior is not changed by
this patch and we may need to come to them in future. One is the
writeback stats used by dirty throttling and second is the deactivation
heuristic in the reclaim. For now keeping an eye on them and if there
is report of regression due to these codepaths, we will reevaluate then.Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault")
Signed-off-by: Shakeel Butt
Reported-by: Daniel Dao
Tested-by: Ivan Babrou
Cc: Michal Hocko
Cc: Roman Gushchin
Cc: Johannes Weiner
Cc: Michal Koutný
Cc: Frank Hofmann
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
24 Sep, 2021
1 commit
-
Prior to the commit 7e1c0d6f5820 ("memcg: switch lruvec stats to rstat")
and the commit aa48e47e3906 ("memcg: infrastructure to flush memcg
stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus *
32) at worst and for unbounded amount of time. The commit aa48e47e3906
moved the lruvec stats to rstat infrastructure and the commit
7e1c0d6f5820 bounded the error for all the lruvec stats to (nr_cpus *
32) at worst for at most 2 seconds. More specifically it decoupled the
number of stats and the number of cgroups from the error rate.However this reduction in error comes with the cost of triggering the
slowpath of stats update more frequently. Previously in the slowpath
the kernel adds the stats up the memcg tree. After aa48e47e3906, the
kernel triggers the asyn lruvec stats flush through queue_work(). This
causes regression reports from 0day kernel bot [1] as well as from
phoronix test suite [2].We tried two options to fix the regression:
1) Increase the threshold to trigger the slowpath in lruvec stats
update codepath from 32 to 512.2) Remove the slowpath from lruvec stats update codepath and instead
flush the stats in the page refault codepath. The assumption is that
the kernel timely flush the stats, so, the update tree would be
small in the refault codepath to not cause the preformance impact.Following are the results of will-it-scale/page_fault[1|2|3] benchmark
on four settings i.e. (1) 5.15-rc1 as baseline (2) 5.15-rc1 with
aa48e47e3906 and 7e1c0d6f5820 reverted (3) 5.15-rc1 with option-1
(4) 5.15-rc1 with option-2.test (1) (2) (3) (4)
pg_f1 368563 406277 (10.23%) 399693 (8.44%) 416398 (12.97%)
pg_f2 338399 372133 (9.96%) 369180 (9.09%) 381024 (12.59%)
pg_f3 500853 575399 (14.88%) 570388 (13.88%) 576083 (15.02%)From the above result, it seems like the option-2 not only solves the
regression but also improves the performance for at least these
benchmarks.Feng Tang (intel) ran the aim7 benchmark with these two options and
confirms that option-1 reduces the regression but option-2 removes the
regression.Michael Larabel (phoronix) ran multiple benchmarks with these options
and reported the results at [3] and it shows for most benchmarks
option-2 removes the regression introduced by the commit aa48e47e3906
("memcg: infrastructure to flush memcg stats").Based on the experiment results, this patch proposed the option-2 as the
solution to resolve the regression.Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020 [1]
Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress [2]
Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104 [3]
Fixes: aa48e47e3906 ("memcg: infrastructure to flush memcg stats")
Signed-off-by: Shakeel Butt
Tested-by: Michael Larabel
Cc: Johannes Weiner
Cc: Roman Gushchin
Cc: Feng Tang
Cc: Michal Hocko
Cc: Hillf Danton ,
Cc: Michal Koutný
Cc: Andrew Morton ,
Signed-off-by: Linus Torvalds
09 Sep, 2021
1 commit
-
Use the documented kernel-doc format to prevent kernel-doc warnings.
mm/workingset.c:256: warning: No description found for return value of 'workingset_eviction'
mm/workingset.c:285: warning: Function parameter or member 'folio' not described in 'workingset_refault'
mm/workingset.c:285: warning: Excess function parameter 'page' description in 'workingset_refault'Link: https://lkml.kernel.org/r/20210808203153.10678-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap
Cc: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Jul, 2021
1 commit
-
The magic number 1 is used in several places in workingset.c. Define a
macro WORKINGSET_SHIFT for it to improve code readability.Link: https://lkml.kernel.org/r/20210624122307.1759342-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Jun, 2021
1 commit
-
All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page) as
the 2nd parameter to it (except isolate_migratepages_block()). But for
isolate_migratepages_block(), the page_pgdat(page) is also equal to the
local variable of @pgdat. So mem_cgroup_page_lruvec() do not need the
pgdat parameter. Just remove it to simplify the code.Link: https://lkml.kernel.org/r/20210417043538.9793-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song
Acked-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Acked-by: Roman Gushchin
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Cc: Xiongchun Duan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 May, 2021
1 commit
-
We no longer need to keep track of how many shadow entries are present in
a mapping. This saves a few writes to the inode and memory barriers.Link: https://lkml.kernel.org/r/20201026151849.24232-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle)
Tested-by: Vishal Verma
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
25 Feb, 2021
2 commits
-
The premise of the refault distance is that it can be seen as a deficit of
the inactive list space, so that if the inactive list would have had (R -
E) more slots, the page would not have been evicted but promoted to the
active list instead.However, the way the code is ordered right now set us to be off by one, so
the real number of slots would be (R - E) + 1. I stumbled upon this when
trying to understand the code and it puzzled me that the comments did not
match what the code did.This it not an issue at all since evictions and refaults tend to happen in
a number large enough that being off-by-one does not have any impact - and
since the compiler and CPUs are free to rearrange the execution sequence
anyway.But as Johannes says, it is better to re-arrange the code in the proper
order since otherwise would be misleading to somebody who is actively
reading and trying to understand the logic of the code - like it happened
to me.Link: https://lkml.kernel.org/r/20210201060651.3781-1-osalvador@suse.de
Signed-off-by: Oscar Salvador
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If list_lru_shrink_count is 0, we always return SHRINK_EMPTY regardless of
the value of max_nodes. So we can return early if nodes == 0 to save some
cpu cycles of approximating a reasonable limit for the nodes.Link: https://lkml.kernel.org/r/20210123073825.46709-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin
Reviewed-by: Andrew Morton
Acked-by: Vlastimil Babka
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Dec, 2020
4 commits
-
Merge more updates from Andrew Morton:
"More MM work: a memcg scalability improvememt"* emailed patches from Andrew Morton :
mm/lru: revise the comments of lru_lock
mm/lru: introduce relock_page_lruvec()
mm/lru: replace pgdat lru_lock with lruvec lock
mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
mm/compaction: do page isolation first in compaction
mm/lru: introduce TestClearPageLRU()
mm/mlock: remove __munlock_isolate_lru_page()
mm/mlock: remove lru_lock on TestClearPageMlocked
mm/vmscan: remove lruvec reget in move_pages_to_lru
mm/lru: move lock into lru_note_cost
mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
mm/memcg: add debug checking in lock_page_memcg
mm: page_idle_get_page() does not need lru_lock
mm/rmap: stop store reordering issue on page->mapping
mm/vmscan: remove unnecessary lruvec adding
mm/thp: narrow lru locking
mm/thp: simplify lru_add_page_tail()
mm/thp: use head for head page in lru_add_page_tail()
mm/thp: move lru_add_page_tail() to huge_memory.c -
We have to move lru_lock into lru_note_cost, since it cycle up on memcg
tree, for future per lruvec lru_lock replace. It's a bit ugly and may
cost a bit more locking, but benefit from multiple memcg locking could
cover the lost.Link: https://lkml.kernel.org/r/1604566549-62481-11-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi
Acked-by: Hugh Dickins
Acked-by: Johannes Weiner
Cc: Johannes Weiner
Cc: Alexander Duyck
Cc: Andrea Arcangeli
Cc: Andrey Ryabinin
Cc: "Chen, Rong A"
Cc: Daniel Jordan
Cc: "Huang, Ying"
Cc: Jann Horn
Cc: Joonsoo Kim
Cc: Kirill A. Shutemov
Cc: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Matthew Wilcox (Oracle)
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Michal Hocko
Cc: Mika Penttilä
Cc: Minchan Kim
Cc: Shakeel Butt
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Vladimir Davydov
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Yang Shi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Pull networking updates from Jakub Kicinski:
"Core:- support "prefer busy polling" NAPI operation mode, where we defer
softirq for some time expecting applications to periodically busy
poll- AF_XDP: improve efficiency by more batching and hindering the
adjacency cache prefetcher- af_packet: make packet_fanout.arr size configurable up to 64K
- tcp: optimize TCP zero copy receive in presence of partial or
unaligned reads making zero copy a performance win for much smaller
messages- XDP: add bulk APIs for returning / freeing frames
- sched: support fragmenting IP packets as they come out of conntrack
- net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
BPF:
- BPF switch from crude rlimit-based to memcg-based memory accounting
- BPF type format information for kernel modules and related tracing
enhancements- BPF implement task local storage for BPF LSM
- allow the FENTRY/FEXIT/RAW_TP tracing programs to use
bpf_sk_storageProtocols:
- mptcp: improve multiple xmit streams support, memory accounting and
many smaller improvements- TLS: support CHACHA20-POLY1305 cipher
- seg6: add support for SRv6 End.DT4/DT6 behavior
- sctp: Implement RFC 6951: UDP Encapsulation of SCTP
- ppp_generic: add ability to bridge channels directly
- bridge: Connectivity Fault Management (CFM) support as is defined
in IEEE 802.1Q section 12.14.Drivers:
- mlx5: make use of the new auxiliary bus to organize the driver
internals- mlx5: more accurate port TX timestamping support
- mlxsw:
- improve the efficiency of offloaded next hop updates by using
the new nexthop object API
- support blackhole nexthops
- support IEEE 802.1ad (Q-in-Q) bridging- rtw88: major bluetooth co-existance improvements
- iwlwifi: support new 6 GHz frequency band
- ath11k: Fast Initial Link Setup (FILS)
- mt7915: dual band concurrent (DBDC) support
- net: ipa: add basic support for IPA v4.5
Refactor:
- a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
Siewior- phy: add support for shared interrupts; get rid of multiple driver
APIs and have the drivers write a full IRQ handler, slight growth
of driver code should be compensated by the simpler API which also
allows shared IRQs- add common code for handling netdev per-cpu counters
- move TX packet re-allocation from Ethernet switch tag drivers to a
central place- improve efficiency and rename nla_strlcpy
- number of W=1 warning cleanups as we now catch those in a patchwork
build botOld code removal:
- wan: delete the DLCI / SDLA drivers
- wimax: move to staging
- wifi: remove old WDS wifi bridging support"
* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
net: hns3: fix expression that is currently always true
net: fix proc_fs init handling in af_packet and tls
nfc: pn533: convert comma to semicolon
af_vsock: Assign the vsock transport considering the vsock address flags
af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
vsock_addr: Check for supported flag values
vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
vm_sockets: Add flags field in the vsock address data structure
net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
nfc: s3fwrn5: Release the nfc firmware
net: vxget: clean up sparse warnings
mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
mlxsw: reg: Add Router LPM Cache Enable Register
mlxsw: reg: Add Router LPM Cache ML Delete Register
mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
mlxsw: reg: Add XM Router M Table Register
... -
The *_lruvec_slab_state is also suitable for pages allocated from buddy,
not just for the slab objects. But the function name seems to tell us
that only slab object is applicable. So we can rename the keyword of slab
to kmem.Link: https://lkml.kernel.org/r/20201117085249.24319-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song
Acked-by: Roman Gushchin
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Dec, 2020
1 commit
-
Patch series "mm: allow mapping accounted kernel pages to userspace", v6.
Currently a non-slab kernel page which has been charged to a memory cgroup
can't be mapped to userspace. The underlying reason is simple: PageKmemcg
flag is defined as a page type (like buddy, offline, etc), so it takes a
bit from a page->mapped counter. Pages with a type set can't be mapped to
userspace.But in general the kmemcg flag has nothing to do with mapping to
userspace. It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.Some bpf maps are mapping the vmalloc-based memory to userspace, and their
memory can't be accounted because of this implementation detail.This patchset removes this limitation by moving the PageKmemcg flag into
one of the free bits of the page->mem_cgroup pointer. Also it formalizes
accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
adds several checks and removes a couple of obsolete functions. As the
result the code became more robust with fewer open-coded bit tricks.This patch (of 4):
Currently there are many open-coded reads of the page->mem_cgroup pointer,
as well as a couple of read helpers, which are barely used.It creates an obstacle on a way to reuse some bits of the pointer for
storing additional bits of information. In fact, we already do this for
slab pages, where the last bit indicates that a pointer has an attached
vector of objcg pointers instead of a regular memcg pointer.This commits uses 2 existing helpers and introduces a new helper to
converts all read sides to calls of these helpers:
struct mem_cgroup *page_memcg(struct page *page);
struct mem_cgroup *page_memcg_rcu(struct page *page);
struct mem_cgroup *page_memcg_check(struct page *page);page_memcg_check() is intended to be used in cases when the page can be a
slab page and have a memcg pointer pointing at objcg vector. It does
check the lowest bit, and if set, returns NULL. page_memcg() contains a
VM_BUG_ON_PAGE() check for the page not being a slab page.To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Signed-off-by: Alexei Starovoitov
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
21 Oct, 2020
1 commit
-
Pull XArray updates from Matthew Wilcox:
- Fix the test suite after introduction of the local_lock
- Fix a bug in the IDA spotted by Coverity
- Change the API that allows the workingset code to delete a node
- Fix xas_reload() when dealing with entries that occupy multiple
indices- Add a few more tests to the test suite
- Fix an unsigned int being shifted into an unsigned long
* tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray:
XArray: Fix xas_create_range for ranges above 4 billion
radix-tree: fix the comment of radix_tree_next_slot()
XArray: Fix xas_reload for multi-index entries
XArray: Add private interface for workingset node deletion
XArray: Fix xas_for_each_conflict documentation
XArray: Test marked multiorder iterations
XArray: Test two more things about xa_cmpxchg
ida: Free allocated bitmap in error path
radix tree test suite: Fix compilation
17 Oct, 2020
1 commit
-
Fix following warnings caused by mismatch bewteen function parameters and
comments.mm/workingset.c:228: warning: Function parameter or member 'lruvec' not described in 'workingset_age_nonresident'
mm/workingset.c:228: warning: Excess function parameter 'memcg' description in 'workingset_age_nonresident'Signed-off-by: Xiaofei Tan
Signed-off-by: Andrew Morton
Link: https://lkml.kernel.org/r/1600485913-11192-1-git-send-email-tanxiaofei@huawei.com
Signed-off-by: Linus Torvalds
13 Oct, 2020
1 commit
-
Move the tricky bits of dealing with the XArray from the workingset
code to the XArray. Make it clear in the documentation that this is a
private interface, and only export it for the benefit of the test suite.Signed-off-by: Matthew Wilcox (Oracle)
15 Aug, 2020
1 commit
-
The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.[akpm@linux-foundation.org: fix mm/migrate.c]
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: William Kucharski
Reviewed-by: Zi Yan
Cc: Mike Kravetz
Cc: David Hildenbrand
Cc: "Kirill A. Shutemov"
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds
13 Aug, 2020
2 commits
-
This patch implements workingset detection for anonymous LRU. All the
infrastructure is implemented by the previous patches so this patch just
activates the workingset detection by installing/retrieving the shadow
entry and adding refault calculation.Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds -
To prepare the workingset detection for anon LRU, this patch splits
workingset event counters for refault, activate and restore into anon and
file variants, as well as the refaults counter in struct lruvec.Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds
08 Aug, 2020
1 commit
-
In order to prepare for per-object slab memory accounting, convert
NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).Internally global and per-node counters are stored in pages, however memcg
and lruvec counters are stored in bytes. This scheme may look weird, but
only for now. As soon as slab pages will be shared between multiple
cgroups, global and node counters will reflect the total number of slab
pages. However memcg and lruvec counters will be used for per-memcg slab
memory tracking, which will take separate kernel objects in the account.
Keeping global and node counters in pages helps to avoid additional
overhead.The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
will fit into atomic_long_t we use for vmstats.Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: Michal Hocko
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
Signed-off-by: Linus Torvalds
26 Jun, 2020
1 commit
-
Patch series "fix for "mm: balance LRU lists based on relative
thrashing" patchset"This patchset fixes some problems of the patchset, "mm: balance LRU
lists based on relative thrashing", which is now merged on the mainline.Patch "mm: workingset: let cache workingset challenge anon fix" is the
result of discussion with Johannes. See following link.http://lkml.kernel.org/r/20200520232525.798933-6-hannes@cmpxchg.org
And, the other two are minor things which are found when I try to rebase
my patchset.This patch (of 3):
After ("mm: workingset: let cache workingset challenge anon fix"), we
compare refault distances to active_file + anon. But age of the
non-resident information is only driven by the file LRU. As a result,
we may overestimate the recency of any incoming refaults and activate
them too eagerly, causing unnecessary LRU churn in certain situations.Make anon aging drive nonresident age as well to address that.
Link: http://lkml.kernel.org/r/1592288204-27734-1-git-send-email-iamjoonsoo.kim@lge.com
Link: http://lkml.kernel.org/r/1592288204-27734-2-git-send-email-iamjoonsoo.kim@lge.com
Fixes: 34e58cac6d8f2a ("mm: workingset: let cache workingset challenge anon")
Reported-by: Joonsoo Kim
Signed-off-by: Johannes Weiner
Signed-off-by: Joonsoo Kim
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Jun, 2020
3 commits
-
The VM tries to balance reclaim pressure between anon and file so as to
reduce the amount of IO incurred due to the memory shortage. It already
counts refaults and swapins, but in addition it should also count
writepage calls during reclaim.For swap, this is obvious: it's IO that wouldn't have occurred if the
anonymous memory hadn't been under memory pressure. From a relative
balancing point of view this makes sense as well: even if anon is cold and
reclaimable, a cache that isn't thrashing may have equally cold pages that
don't require IO to reclaim.For file writeback, it's trickier: some of the reclaim writepage IO would
have likely occurred anyway due to dirty expiration. But not all of it -
premature writeback reduces batching and generates additional writes.
Since the flushers are already woken up by the time the VM starts writing
cache pages one by one, let's assume that we'e likely causing writes that
wouldn't have happened without memory pressure. In addition, the per-page
cost of IO would have probably been much cheaper if written in larger
batches from the flusher thread rather than the single-page-writes from
kswapd.For our purposes - getting the trend right to accelerate convergence on a
stable state that doesn't require paging at all - this is sufficiently
accurate. If we later wanted to optimize for sustained thrashing, we can
still refine the measurements.Count all writepage calls from kswapd as IO cost toward the LRU that the
page belongs to.Why do this dynamically? Don't we know in advance that anon pages require
IO to reclaim, and so could build in a static bias?First, scanning is not the same as reclaiming. If all the anon pages are
referenced, we may not swap for a while just because we're scanning the
anon list. During this time, however, it's important that we age
anonymous memory and the page cache at the same rate so that their
hot-cold gradients are comparable. Everything else being equal, we still
want to reclaim the coldest memory overall.Second, we keep copies in swap unless the page changes. If there is
swap-backed data that's mostly read (tmpfs file) and has been swapped out
before, we can reclaim it without incurring additional IO.Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds -
Since the LRUs were split into anon and file lists, the VM has been
balancing between page cache and anonymous pages based on per-list ratios
of scanned vs. rotated pages. In most cases that tips page reclaim
towards the list that is easier to reclaim and has the fewest actively
used pages, but there are a few problems with it:1. Refaults and LRU rotations are weighted the same way, even though
one costs IO and the other costs a bit of CPU.2. The less we scan an LRU list based on already observed rotations,
the more we increase the sampling interval for new references, and
rotations become even more likely on that list. This can enter a
death spiral in which we stop looking at one list completely until
the other one is all but annihilated by page reclaim.Since commit a528910e12ec ("mm: thrash detection-based file cache sizing")
we have refault detection for the page cache. Along with swapin events,
they are good indicators of when the file or anon list, respectively, is
too small for its workingset and needs to grow.For example, if the page cache is thrashing, the cache pages need more
time in memory, while there may be colder pages on the anonymous list.
Likewise, if swapped pages are faulting back in, it indicates that we
reclaim anonymous pages too aggressively and should back off.Replace LRU rotations with refaults and swapins as the basis for relative
reclaim cost of the two LRUs. This will have the VM target list balances
that incur the least amount of IO on aggregate.Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds -
We activate cache refaults with reuse distances in pages smaller than the
size of the total cache. This allows new pages with competitive access
frequencies to establish themselves, as well as challenge and potentially
displace pages on the active list that have gone cold.However, that assumes that active cache can only replace other active
cache in a competition for the hottest memory. This is not a great
default assumption. The page cache might be thrashing while there are
enough completely cold and unused anonymous pages sitting around that we'd
only have to write to swap once to stop all IO from the cache.Activate cache refaults when their reuse distance in pages is smaller than
the total userspace workingset, including anonymous pages.Reclaim can still decide how to balance pressure among the two LRUs
depending on the IO situation. Rotational drives will prefer avoiding
random IO from swap and go harder after cache. But fundamentally, hot
cache should be able to compete with anon pages for a place in RAM.Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Link: http://lkml.kernel.org/r/20200520232525.798933-6-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds
02 Dec, 2019
2 commits
-
We use refault information to determine whether the cache workingset is
stable or transitioning, and dynamically adjust the inactive:active file
LRU ratio so as to maximize protection from one-off cache during stable
periods, and minimize IO during transitions.With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, refaults only affect the
local LRU order in the cgroup in which they are occuring. As a result,
cache transitions can take longer in a cgrouped system as the active pages
of sibling cgroups aren't challenged when they should be.[ Right now, this is somewhat theoretical, because the siblings, under
continued regular reclaim pressure, should eventually run out of
inactive pages - and since inactive:active *size* balancing is also
done on a cgroup-local level, we will challenge the active pages
eventually in most cases. But the next patch will move that relative
size enforcement to the reclaim root as well, and then this patch
here will be necessary to propagate refault pressure to siblings. ]This patch moves refault detection to the root of reclaim. Instead of
remembering the cgroup owner of an evicted page, remember the cgroup that
caused the reclaim to happen. When refaults later occur, they'll
correctly influence the cross-cgroup LRU order that reclaim follows.I.e. if global reclaim kicked out pages in some subgroup A/B/C, the
refault of those pages will challenge the global LRU order, and not just
the local order down inside C.[hannes@cmpxchg.org: use page_memcg() instead of another lookup]
Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org
Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Suren Baghdasaryan
Cc: Andrey Ryabinin
Cc: Michal Hocko
Cc: Rik van Riel
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There is a per-memcg lruvec and a NUMA node lruvec. Which one is being
used is somewhat confusing right now, and it's easy to make mistakes -
especially when it comes to global reclaim.How it works: when memory cgroups are enabled, we always use the
root_mem_cgroup's per-node lruvecs. When memory cgroups are not compiled
in or disabled at runtime, we use pgdat->lruvec.Document that in a comment.
Due to the way the reclaim code is generalized, all lookups use the
mem_cgroup_lruvec() helper function, and nobody should have to find the
right lruvec manually right now. But to avoid future mistakes, rename the
pgdat->lruvec member to pgdat->__lruvec and delete the convenience wrapper
that suggests it's a commonly accessed member.While in this area, swap the mem_cgroup_lruvec() argument order. The name
suggests a memcg operation, yet it takes a pgdat first and a memcg second.
I have to double take every time I call this. Fix that.Link: http://lkml.kernel.org/r/20191022144803.302233-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Reviewed-by: Shakeel Butt
Cc: Roman Gushchin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
14 Aug, 2019
1 commit
-
Memcg counters for shadow nodes are broken because the memcg pointer is
obtained in a wrong way. The following approach is used:
virt_to_page(xa_node)->mem_cgroupSince commit 4d96ba353075 ("mm: memcg/slab: stop setting
page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
set for slab pages, so memcg_from_slab_page() should be used instead.Also I doubt that it ever worked correctly: virt_to_head_page() should
be used instead of virt_to_page(). Otherwise objects residing on tail
pages are not accounted, because only the head page contains a valid
mem_cgroup pointer. That was a case since the introduction of these
counters by the commit 68d48e6a2df5 ("mm: workingset: add vmstat counter
for shadow nodes").Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin
Acked-by: Johannes Weiner
Cc: Vladimir Davydov
Cc: Shakeel Butt
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 May, 2019
2 commits
-
Patch series "mm: memcontrol: memory.stat cost & correctness".
The cgroup memory.stat file holds recursive statistics for the entire
subtree. The current implementation does this tree walk on-demand
whenever the file is read. This is giving us problems in production.1. The cost of aggregating the statistics on-demand is high. A lot of
system service cgroups are mostly idle and their stats don't change
between reads, yet we always have to check them. There are also always
some lazily-dying cgroups sitting around that are pinned by a handful
of remaining page cache; the same applies to them.In an application that periodically monitors memory.stat in our
fleet, we have seen the aggregation consume up to 5% CPU time.2. When cgroups die and disappear from the cgroup tree, so do their
accumulated vm events. The result is that the event counters at
higher-level cgroups can go backwards and confuse some of our
automation, let alone people looking at the graphs over time.To address both issues, this patch series changes the stat
implementation to spill counts upwards when the counters change.The upward spilling is batched using the existing per-cpu cache. In a
sparse file stress test with 5 level cgroup nesting, the additional cost
of the flushing was negligible (a little under 1% of CPU at 100% CPU
utilization, compared to the 5% of reading memory.stat during regular
operation).This patch (of 4):
memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
currently returning the state of the local memcg or lruvec, not the
recursive state.In practice there is a demand for both versions, although the callers
that want the recursive counts currently sum them up by hand.Per default, cgroups are considered recursive entities and generally we
expect more users of the recursive counters, with the local counts being
special cases. To reflect that in the name, add a _local suffix to the
current implementations.The following patch will re-incarnate these functions with recursive
semantics, but with an O(1) implementation.[hannes@cmpxchg.org: fix bisection hole]
Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mem_cgroup_node_nr_lru_pages() is just a convenience wrapper around
lruvec_page_state() that takes bitmasks of lru indexes and aggregates the
counts for those.Replace callsites where the bitmask is simple enough with direct
lruvec_page_state() calls.This removes the last extern user of mem_cgroup_node_nr_lru_pages(), so
make that function private again, too.Link: http://lkml.kernel.org/r/20190228163020.24100-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Mar, 2019
1 commit
-
workingset_eviction() doesn't use and never did use the @mapping
argument. Remove it.Link: http://lkml.kernel.org/r/20190228083329.31892-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Acked-by: Johannes Weiner
Acked-by: Rik van Riel
Acked-by: Vlastimil Babka
Acked-by: Mel Gorman
Cc: Michal Hocko
Cc: William Kucharski
Cc: John Hubbard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Dec, 2018
1 commit
-
totalram_pages and totalhigh_pages are made static inline function.
Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS
Suggested-by: Michal Hocko
Suggested-by: Vlastimil Babka
Reviewed-by: Konstantin Khlebnikov
Reviewed-by: Pavel Tatashin
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: David Hildenbrand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Oct, 2018
1 commit
-
Pull XArray conversion from Matthew Wilcox:
"The XArray provides an improved interface to the radix tree data
structure, providing locking as part of the API, specifying GFP flags
at allocation time, eliminating preloading, less re-walking the tree,
more efficient iterations and not exposing RCU-protected pointers to
its users.This patch set
1. Introduces the XArray implementation
2. Converts the pagecache to use it
3. Converts memremap to use it
The page cache is the most complex and important user of the radix
tree, so converting it was most important. Converting the memremap
code removes the only other user of the multiorder code, which allows
us to remove the radix tree code that supported it.I have 40+ followup patches to convert many other users of the radix
tree over to the XArray, but I'd like to get this part in first. The
other conversions haven't been in linux-next and aren't suitable for
applying yet, but you can see them in the xarray-conv branch if you're
interested"* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
radix tree: Remove multiorder support
radix tree test: Convert multiorder tests to XArray
radix tree tests: Convert item_delete_rcu to XArray
radix tree tests: Convert item_kill_tree to XArray
radix tree tests: Move item_insert_order
radix tree test suite: Remove multiorder benchmarking
radix tree test suite: Remove __item_insert
memremap: Convert to XArray
xarray: Add range store functionality
xarray: Move multiorder_check to in-kernel tests
xarray: Move multiorder_shrink to kernel tests
xarray: Move multiorder account test in-kernel
radix tree test suite: Convert iteration test to XArray
radix tree test suite: Convert tag_tagged_items to XArray
radix tree: Remove radix_tree_clear_tags
radix tree: Remove radix_tree_maybe_preload_order
radix tree: Remove split/join code
radix tree: Remove radix_tree_update_node_t
page cache: Finish XArray conversion
dax: Convert page fault handlers to XArray
...
27 Oct, 2018
5 commits
-
The page cache and most shrinkable slab caches hold data that has been
read from disk, but there are some caches that only cache CPU work, such
as the dentry and inode caches of procfs and sysfs, as well as the subset
of radix tree nodes that track non-resident page cache.Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
the shrinker's seeks setting tells the reclaim algorithm that for every
two page cache pages scanned it should scan one slab object.This is a bogus setting. A virtual inode that required no IO to create is
not twice as valuable as a page cache page; shadow cache entries with
eviction distances beyond the size of memory aren't either.In most cases, the behavior in practice is still fine. Such virtual
caches don't tend to grow and assert themselves aggressively, and usually
get picked up before they cause problems. But there are scenarios where
that's not true.Our database workloads suffer from two of those. For one, their file
workingset is several times bigger than available memory, which has the
kernel aggressively create shadow page cache entries for the non-resident
parts of it. The workingset code does tell the VM that most of these are
expendable, but the VM ends up balancing them 2:1 to cache pages as per
the seeks setting. This is a huge waste of memory.These workloads also deal with tens of thousands of open files and use
/proc for introspection, which ends up growing the proc_inode_cache to
absurdly large sizes - again at the cost of valuable cache space, which
isn't a reasonable trade-off, given that proc inodes can be re-created
without involving the disk.This patch implements a "zero-seek" setting for shrinkers that results in
a target ratio of 0:1 between their objects and IO-backed caches. This
allows such virtual caches to grow when memory is available (they do
cache/avoid CPU work after all), but effectively disables them as soon as
IO-backed objects are under pressure.It then switches the shrinkers for procfs and sysfs metadata, as well as
excess page cache shadow nodes, to the new zero-seek setting.Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reported-by: Domas Mituzas
Reviewed-by: Andrew Morton
Reviewed-by: Rik van Riel
Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make it easier to catch bugs in the shadow node shrinker by adding a
counter for the shadow nodes in circulation.[akpm@linux-foundation.org: assert that irqs are disabled, for __inc_lruvec_page_state()]
[akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/, per Johannes]
Link: http://lkml.kernel.org/r/20181009184732.762-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Andrew Morton
Acked-by: Peter Zijlstra (Intel)
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
No need to use the preemption-safe lruvec state function inside the
reclaim region that has irqs disabled.Link: http://lkml.kernel.org/r/20181009184732.762-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Andrew Morton
Reviewed-by: Rik van Riel
Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Peter Zijlstra (Intel)
Tested-by: Daniel Drake
Tested-by: Suren Baghdasaryan
Cc: Christopher Lameter
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Mike Galbraith
Cc: Peter Enderborg
Cc: Randy Dunlap
Cc: Shakeel Butt
Cc: Tejun Heo
Cc: Vinayak Menon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
Overview
PSI reports the overall wallclock time in which the tasks in a system (or
cgroup) wait for (contended) hardware resources.This helps users understand the resource pressure their workloads are
under, which allows them to rootcause and fix throughput and latency
problems caused by overcommitting, underprovisioning, suboptimal job
placement in a grid; as well as anticipate major disruptions like OOM.Real-world applications
We're using the data collected by PSI (and its previous incarnation,
memdelay) quite extensively at Facebook, and with several success stories.One usecase is avoiding OOM hangs/livelocks. The reason these happen is
because the OOM killer is triggered by reclaim not being able to free
pages, but with fast flash devices there is *always* some clean and
uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
spend 90% of the time thrashing the cache pages of their own executables.
There is no situation where this ever makes sense in practice. We wrote aAcked-by: Peter Zijlstra (Intel)
Reviewed-by: Rik van Riel
Tested-by: Daniel Drake
Tested-by: Suren Baghdasaryan
Cc: Ingo Molnar
Cc: Tejun Heo
Cc: Vinayak Menon
Cc: Christopher Lameter
Cc: Peter Enderborg
Cc: Shakeel Butt
Cc: Mike Galbraith
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Oct, 2018
2 commits
-
We construct an XA_STATE and use it to delete the node with
xas_store() rather than adding a special function for this unique
use case. Includes a test that simulates this usage for the
test suite.Signed-off-by: Matthew Wilcox
-
This is a direct replacement for struct radix_tree_node. A couple of
struct members have changed name, so convert those. Use a #define so
that radix tree users continue to work without change.Signed-off-by: Matthew Wilcox
Reviewed-by: Josef Bacik
30 Sep, 2018
1 commit
-
Introduce xarray value entries and tagged pointers to replace radix
tree exceptional entries. This is a slight change in encoding to allow
the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
value entry). It is also a change in emphasis; exceptional entries are
intimidating and different. As the comment explains, you can choose
to store values or pointers in the xarray and they are both first-class
citizens.Signed-off-by: Matthew Wilcox
Reviewed-by: Josef Bacik