27 Apr, 2022

1 commit

  • commit 9b3016154c913b2e7ec5ae5c9a42eb9e732d86aa upstream.

    Daniel Dao has reported [1] a regression on workloads that may trigger a
    lot of refaults (anon and file). The underlying issue is that flushing
    rstat is expensive. Although rstat flush are batched with (nr_cpus *
    MEMCG_BATCH) stat updates, it seems like there are workloads which
    genuinely do stat updates larger than batch value within short amount of
    time. Since the rstat flush can happen in the performance critical
    codepaths like page faults, such workload can suffer greatly.

    This patch fixes this regression by making the rstat flushing
    conditional in the performance critical codepaths. More specifically,
    the kernel relies on the async periodic rstat flusher to flush the stats
    and only if the periodic flusher is delayed by more than twice the
    amount of its normal time window then the kernel allows rstat flushing
    from the performance critical codepaths.

    Now the question: what are the side-effects of this change? The worst
    that can happen is the refault codepath will see 4sec old lruvec stats
    and may cause false (or missed) activations of the refaulted page which
    may under-or-overestimate the workingset size. Though that is not very
    concerning as the kernel can already miss or do false activations.

    There are two more codepaths whose flushing behavior is not changed by
    this patch and we may need to come to them in future. One is the
    writeback stats used by dirty throttling and second is the deactivation
    heuristic in the reclaim. For now keeping an eye on them and if there
    is report of regression due to these codepaths, we will reevaluate then.

    Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
    Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
    Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault")
    Signed-off-by: Shakeel Butt
    Reported-by: Daniel Dao
    Tested-by: Ivan Babrou
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Michal Koutný
    Cc: Frank Hofmann
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Shakeel Butt
     

24 Sep, 2021

1 commit

  • Prior to the commit 7e1c0d6f5820 ("memcg: switch lruvec stats to rstat")
    and the commit aa48e47e3906 ("memcg: infrastructure to flush memcg
    stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus *
    32) at worst and for unbounded amount of time. The commit aa48e47e3906
    moved the lruvec stats to rstat infrastructure and the commit
    7e1c0d6f5820 bounded the error for all the lruvec stats to (nr_cpus *
    32) at worst for at most 2 seconds. More specifically it decoupled the
    number of stats and the number of cgroups from the error rate.

    However this reduction in error comes with the cost of triggering the
    slowpath of stats update more frequently. Previously in the slowpath
    the kernel adds the stats up the memcg tree. After aa48e47e3906, the
    kernel triggers the asyn lruvec stats flush through queue_work(). This
    causes regression reports from 0day kernel bot [1] as well as from
    phoronix test suite [2].

    We tried two options to fix the regression:

    1) Increase the threshold to trigger the slowpath in lruvec stats
    update codepath from 32 to 512.

    2) Remove the slowpath from lruvec stats update codepath and instead
    flush the stats in the page refault codepath. The assumption is that
    the kernel timely flush the stats, so, the update tree would be
    small in the refault codepath to not cause the preformance impact.

    Following are the results of will-it-scale/page_fault[1|2|3] benchmark
    on four settings i.e. (1) 5.15-rc1 as baseline (2) 5.15-rc1 with
    aa48e47e3906 and 7e1c0d6f5820 reverted (3) 5.15-rc1 with option-1
    (4) 5.15-rc1 with option-2.

    test (1) (2) (3) (4)
    pg_f1 368563 406277 (10.23%) 399693 (8.44%) 416398 (12.97%)
    pg_f2 338399 372133 (9.96%) 369180 (9.09%) 381024 (12.59%)
    pg_f3 500853 575399 (14.88%) 570388 (13.88%) 576083 (15.02%)

    From the above result, it seems like the option-2 not only solves the
    regression but also improves the performance for at least these
    benchmarks.

    Feng Tang (intel) ran the aim7 benchmark with these two options and
    confirms that option-1 reduces the regression but option-2 removes the
    regression.

    Michael Larabel (phoronix) ran multiple benchmarks with these options
    and reported the results at [3] and it shows for most benchmarks
    option-2 removes the regression introduced by the commit aa48e47e3906
    ("memcg: infrastructure to flush memcg stats").

    Based on the experiment results, this patch proposed the option-2 as the
    solution to resolve the regression.

    Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020 [1]
    Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress [2]
    Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104 [3]
    Fixes: aa48e47e3906 ("memcg: infrastructure to flush memcg stats")
    Signed-off-by: Shakeel Butt
    Tested-by: Michael Larabel
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Cc: Feng Tang
    Cc: Michal Hocko
    Cc: Hillf Danton ,
    Cc: Michal Koutný
    Cc: Andrew Morton ,
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

09 Sep, 2021

1 commit

  • Use the documented kernel-doc format to prevent kernel-doc warnings.

    mm/workingset.c:256: warning: No description found for return value of 'workingset_eviction'
    mm/workingset.c:285: warning: Function parameter or member 'folio' not described in 'workingset_refault'
    mm/workingset.c:285: warning: Excess function parameter 'page' description in 'workingset_refault'

    Link: https://lkml.kernel.org/r/20210808203153.10678-1-rdunlap@infradead.org
    Signed-off-by: Randy Dunlap
    Cc: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

01 Jul, 2021

1 commit

  • The magic number 1 is used in several places in workingset.c. Define a
    macro WORKINGSET_SHIFT for it to improve code readability.

    Link: https://lkml.kernel.org/r/20210624122307.1759342-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

30 Jun, 2021

1 commit

  • All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page) as
    the 2nd parameter to it (except isolate_migratepages_block()). But for
    isolate_migratepages_block(), the page_pgdat(page) is also equal to the
    local variable of @pgdat. So mem_cgroup_page_lruvec() do not need the
    pgdat parameter. Just remove it to simplify the code.

    Link: https://lkml.kernel.org/r/20210417043538.9793-4-songmuchun@bytedance.com
    Signed-off-by: Muchun Song
    Acked-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Xiongchun Duan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Muchun Song
     

06 May, 2021

1 commit

  • We no longer need to keep track of how many shadow entries are present in
    a mapping. This saves a few writes to the inode and memory barriers.

    Link: https://lkml.kernel.org/r/20201026151849.24232-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Tested-by: Vishal Verma
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

25 Feb, 2021

2 commits

  • The premise of the refault distance is that it can be seen as a deficit of
    the inactive list space, so that if the inactive list would have had (R -
    E) more slots, the page would not have been evicted but promoted to the
    active list instead.

    However, the way the code is ordered right now set us to be off by one, so
    the real number of slots would be (R - E) + 1. I stumbled upon this when
    trying to understand the code and it puzzled me that the comments did not
    match what the code did.

    This it not an issue at all since evictions and refaults tend to happen in
    a number large enough that being off-by-one does not have any impact - and
    since the compiler and CPUs are free to rearrange the execution sequence
    anyway.

    But as Johannes says, it is better to re-arrange the code in the proper
    order since otherwise would be misleading to somebody who is actively
    reading and trying to understand the logic of the code - like it happened
    to me.

    Link: https://lkml.kernel.org/r/20210201060651.3781-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • If list_lru_shrink_count is 0, we always return SHRINK_EMPTY regardless of
    the value of max_nodes. So we can return early if nodes == 0 to save some
    cpu cycles of approximating a reasonable limit for the nodes.

    Link: https://lkml.kernel.org/r/20210123073825.46709-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Reviewed-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

16 Dec, 2020

4 commits

  • Merge more updates from Andrew Morton:
    "More MM work: a memcg scalability improvememt"

    * emailed patches from Andrew Morton :
    mm/lru: revise the comments of lru_lock
    mm/lru: introduce relock_page_lruvec()
    mm/lru: replace pgdat lru_lock with lruvec lock
    mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
    mm/compaction: do page isolation first in compaction
    mm/lru: introduce TestClearPageLRU()
    mm/mlock: remove __munlock_isolate_lru_page()
    mm/mlock: remove lru_lock on TestClearPageMlocked
    mm/vmscan: remove lruvec reget in move_pages_to_lru
    mm/lru: move lock into lru_note_cost
    mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
    mm/memcg: add debug checking in lock_page_memcg
    mm: page_idle_get_page() does not need lru_lock
    mm/rmap: stop store reordering issue on page->mapping
    mm/vmscan: remove unnecessary lruvec adding
    mm/thp: narrow lru locking
    mm/thp: simplify lru_add_page_tail()
    mm/thp: use head for head page in lru_add_page_tail()
    mm/thp: move lru_add_page_tail() to huge_memory.c

    Linus Torvalds
     
  • We have to move lru_lock into lru_note_cost, since it cycle up on memcg
    tree, for future per lruvec lru_lock replace. It's a bit ugly and may
    cost a bit more locking, but benefit from multiple memcg locking could
    cover the lost.

    Link: https://lkml.kernel.org/r/1604566549-62481-11-git-send-email-alex.shi@linux.alibaba.com
    Signed-off-by: Alex Shi
    Acked-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: Johannes Weiner
    Cc: Alexander Duyck
    Cc: Andrea Arcangeli
    Cc: Andrey Ryabinin
    Cc: "Chen, Rong A"
    Cc: Daniel Jordan
    Cc: "Huang, Ying"
    Cc: Jann Horn
    Cc: Joonsoo Kim
    Cc: Kirill A. Shutemov
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox (Oracle)
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Michal Hocko
    Cc: Mika Penttilä
    Cc: Minchan Kim
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Shi
     
  • Pull networking updates from Jakub Kicinski:
    "Core:

    - support "prefer busy polling" NAPI operation mode, where we defer
    softirq for some time expecting applications to periodically busy
    poll

    - AF_XDP: improve efficiency by more batching and hindering the
    adjacency cache prefetcher

    - af_packet: make packet_fanout.arr size configurable up to 64K

    - tcp: optimize TCP zero copy receive in presence of partial or
    unaligned reads making zero copy a performance win for much smaller
    messages

    - XDP: add bulk APIs for returning / freeing frames

    - sched: support fragmenting IP packets as they come out of conntrack

    - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs

    BPF:

    - BPF switch from crude rlimit-based to memcg-based memory accounting

    - BPF type format information for kernel modules and related tracing
    enhancements

    - BPF implement task local storage for BPF LSM

    - allow the FENTRY/FEXIT/RAW_TP tracing programs to use
    bpf_sk_storage

    Protocols:

    - mptcp: improve multiple xmit streams support, memory accounting and
    many smaller improvements

    - TLS: support CHACHA20-POLY1305 cipher

    - seg6: add support for SRv6 End.DT4/DT6 behavior

    - sctp: Implement RFC 6951: UDP Encapsulation of SCTP

    - ppp_generic: add ability to bridge channels directly

    - bridge: Connectivity Fault Management (CFM) support as is defined
    in IEEE 802.1Q section 12.14.

    Drivers:

    - mlx5: make use of the new auxiliary bus to organize the driver
    internals

    - mlx5: more accurate port TX timestamping support

    - mlxsw:
    - improve the efficiency of offloaded next hop updates by using
    the new nexthop object API
    - support blackhole nexthops
    - support IEEE 802.1ad (Q-in-Q) bridging

    - rtw88: major bluetooth co-existance improvements

    - iwlwifi: support new 6 GHz frequency band

    - ath11k: Fast Initial Link Setup (FILS)

    - mt7915: dual band concurrent (DBDC) support

    - net: ipa: add basic support for IPA v4.5

    Refactor:

    - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
    Siewior

    - phy: add support for shared interrupts; get rid of multiple driver
    APIs and have the drivers write a full IRQ handler, slight growth
    of driver code should be compensated by the simpler API which also
    allows shared IRQs

    - add common code for handling netdev per-cpu counters

    - move TX packet re-allocation from Ethernet switch tag drivers to a
    central place

    - improve efficiency and rename nla_strlcpy

    - number of W=1 warning cleanups as we now catch those in a patchwork
    build bot

    Old code removal:

    - wan: delete the DLCI / SDLA drivers

    - wimax: move to staging

    - wifi: remove old WDS wifi bridging support"

    * tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
    net: hns3: fix expression that is currently always true
    net: fix proc_fs init handling in af_packet and tls
    nfc: pn533: convert comma to semicolon
    af_vsock: Assign the vsock transport considering the vsock address flags
    af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
    vsock_addr: Check for supported flag values
    vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
    vm_sockets: Add flags field in the vsock address data structure
    net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
    tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
    net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
    nfc: s3fwrn5: Release the nfc firmware
    net: vxget: clean up sparse warnings
    mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
    mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
    mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
    mlxsw: reg: Add Router LPM Cache Enable Register
    mlxsw: reg: Add Router LPM Cache ML Delete Register
    mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
    mlxsw: reg: Add XM Router M Table Register
    ...

    Linus Torvalds
     
  • The *_lruvec_slab_state is also suitable for pages allocated from buddy,
    not just for the slab objects. But the function name seems to tell us
    that only slab object is applicable. So we can rename the keyword of slab
    to kmem.

    Link: https://lkml.kernel.org/r/20201117085249.24319-1-songmuchun@bytedance.com
    Signed-off-by: Muchun Song
    Acked-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Muchun Song
     

03 Dec, 2020

1 commit

  • Patch series "mm: allow mapping accounted kernel pages to userspace", v6.

    Currently a non-slab kernel page which has been charged to a memory cgroup
    can't be mapped to userspace. The underlying reason is simple: PageKmemcg
    flag is defined as a page type (like buddy, offline, etc), so it takes a
    bit from a page->mapped counter. Pages with a type set can't be mapped to
    userspace.

    But in general the kmemcg flag has nothing to do with mapping to
    userspace. It only means that the page has been accounted by the page
    allocator, so it has to be properly uncharged on release.

    Some bpf maps are mapping the vmalloc-based memory to userspace, and their
    memory can't be accounted because of this implementation detail.

    This patchset removes this limitation by moving the PageKmemcg flag into
    one of the free bits of the page->mem_cgroup pointer. Also it formalizes
    accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
    adds several checks and removes a couple of obsolete functions. As the
    result the code became more robust with fewer open-coded bit tricks.

    This patch (of 4):

    Currently there are many open-coded reads of the page->mem_cgroup pointer,
    as well as a couple of read helpers, which are barely used.

    It creates an obstacle on a way to reuse some bits of the pointer for
    storing additional bits of information. In fact, we already do this for
    slab pages, where the last bit indicates that a pointer has an attached
    vector of objcg pointers instead of a regular memcg pointer.

    This commits uses 2 existing helpers and introduces a new helper to
    converts all read sides to calls of these helpers:
    struct mem_cgroup *page_memcg(struct page *page);
    struct mem_cgroup *page_memcg_rcu(struct page *page);
    struct mem_cgroup *page_memcg_check(struct page *page);

    page_memcg_check() is intended to be used in cases when the page can be a
    slab page and have a memcg pointer pointing at objcg vector. It does
    check the lowest bit, and if set, returns NULL. page_memcg() contains a
    VM_BUG_ON_PAGE() check for the page not being a slab page.

    To make sure nobody uses a direct access, struct page's
    mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
    Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
    Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com

    Roman Gushchin
     

21 Oct, 2020

1 commit

  • Pull XArray updates from Matthew Wilcox:

    - Fix the test suite after introduction of the local_lock

    - Fix a bug in the IDA spotted by Coverity

    - Change the API that allows the workingset code to delete a node

    - Fix xas_reload() when dealing with entries that occupy multiple
    indices

    - Add a few more tests to the test suite

    - Fix an unsigned int being shifted into an unsigned long

    * tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray:
    XArray: Fix xas_create_range for ranges above 4 billion
    radix-tree: fix the comment of radix_tree_next_slot()
    XArray: Fix xas_reload for multi-index entries
    XArray: Add private interface for workingset node deletion
    XArray: Fix xas_for_each_conflict documentation
    XArray: Test marked multiorder iterations
    XArray: Test two more things about xa_cmpxchg
    ida: Free allocated bitmap in error path
    radix tree test suite: Fix compilation

    Linus Torvalds
     

17 Oct, 2020

1 commit

  • Fix following warnings caused by mismatch bewteen function parameters and
    comments.

    mm/workingset.c:228: warning: Function parameter or member 'lruvec' not described in 'workingset_age_nonresident'
    mm/workingset.c:228: warning: Excess function parameter 'memcg' description in 'workingset_age_nonresident'

    Signed-off-by: Xiaofei Tan
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/1600485913-11192-1-git-send-email-tanxiaofei@huawei.com
    Signed-off-by: Linus Torvalds

    Xiaofei Tan
     

13 Oct, 2020

1 commit


15 Aug, 2020

1 commit

  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

2 commits

  • This patch implements workingset detection for anonymous LRU. All the
    infrastructure is implemented by the previous patches so this patch just
    activates the workingset detection by installing/retrieving the shadow
    entry and adding refault calculation.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • To prepare the workingset detection for anon LRU, this patch splits
    workingset event counters for refault, activate and restore into anon and
    file variants, as well as the refaults counter in struct lruvec.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

08 Aug, 2020

1 commit

  • In order to prepare for per-object slab memory accounting, convert
    NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.

    To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
    NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).

    Internally global and per-node counters are stored in pages, however memcg
    and lruvec counters are stored in bytes. This scheme may look weird, but
    only for now. As soon as slab pages will be shared between multiple
    cgroups, global and node counters will reflect the total number of slab
    pages. However memcg and lruvec counters will be used for per-memcg slab
    memory tracking, which will take separate kernel objects in the account.
    Keeping global and node counters in pages helps to avoid additional
    overhead.

    The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
    will fit into atomic_long_t we use for vmstats.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

26 Jun, 2020

1 commit

  • Patch series "fix for "mm: balance LRU lists based on relative
    thrashing" patchset"

    This patchset fixes some problems of the patchset, "mm: balance LRU
    lists based on relative thrashing", which is now merged on the mainline.

    Patch "mm: workingset: let cache workingset challenge anon fix" is the
    result of discussion with Johannes. See following link.

    http://lkml.kernel.org/r/20200520232525.798933-6-hannes@cmpxchg.org

    And, the other two are minor things which are found when I try to rebase
    my patchset.

    This patch (of 3):

    After ("mm: workingset: let cache workingset challenge anon fix"), we
    compare refault distances to active_file + anon. But age of the
    non-resident information is only driven by the file LRU. As a result,
    we may overestimate the recency of any incoming refaults and activate
    them too eagerly, causing unnecessary LRU churn in certain situations.

    Make anon aging drive nonresident age as well to address that.

    Link: http://lkml.kernel.org/r/1592288204-27734-1-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1592288204-27734-2-git-send-email-iamjoonsoo.kim@lge.com
    Fixes: 34e58cac6d8f2a ("mm: workingset: let cache workingset challenge anon")
    Reported-by: Joonsoo Kim
    Signed-off-by: Johannes Weiner
    Signed-off-by: Joonsoo Kim
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 Jun, 2020

3 commits

  • The VM tries to balance reclaim pressure between anon and file so as to
    reduce the amount of IO incurred due to the memory shortage. It already
    counts refaults and swapins, but in addition it should also count
    writepage calls during reclaim.

    For swap, this is obvious: it's IO that wouldn't have occurred if the
    anonymous memory hadn't been under memory pressure. From a relative
    balancing point of view this makes sense as well: even if anon is cold and
    reclaimable, a cache that isn't thrashing may have equally cold pages that
    don't require IO to reclaim.

    For file writeback, it's trickier: some of the reclaim writepage IO would
    have likely occurred anyway due to dirty expiration. But not all of it -
    premature writeback reduces batching and generates additional writes.
    Since the flushers are already woken up by the time the VM starts writing
    cache pages one by one, let's assume that we'e likely causing writes that
    wouldn't have happened without memory pressure. In addition, the per-page
    cost of IO would have probably been much cheaper if written in larger
    batches from the flusher thread rather than the single-page-writes from
    kswapd.

    For our purposes - getting the trend right to accelerate convergence on a
    stable state that doesn't require paging at all - this is sufficiently
    accurate. If we later wanted to optimize for sustained thrashing, we can
    still refine the measurements.

    Count all writepage calls from kswapd as IO cost toward the LRU that the
    page belongs to.

    Why do this dynamically? Don't we know in advance that anon pages require
    IO to reclaim, and so could build in a static bias?

    First, scanning is not the same as reclaiming. If all the anon pages are
    referenced, we may not swap for a while just because we're scanning the
    anon list. During this time, however, it's important that we age
    anonymous memory and the page cache at the same rate so that their
    hot-cold gradients are comparable. Everything else being equal, we still
    want to reclaim the coldest memory overall.

    Second, we keep copies in swap unless the page changes. If there is
    swap-backed data that's mostly read (tmpfs file) and has been swapped out
    before, we can reclaim it without incurring additional IO.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since the LRUs were split into anon and file lists, the VM has been
    balancing between page cache and anonymous pages based on per-list ratios
    of scanned vs. rotated pages. In most cases that tips page reclaim
    towards the list that is easier to reclaim and has the fewest actively
    used pages, but there are a few problems with it:

    1. Refaults and LRU rotations are weighted the same way, even though
    one costs IO and the other costs a bit of CPU.

    2. The less we scan an LRU list based on already observed rotations,
    the more we increase the sampling interval for new references, and
    rotations become even more likely on that list. This can enter a
    death spiral in which we stop looking at one list completely until
    the other one is all but annihilated by page reclaim.

    Since commit a528910e12ec ("mm: thrash detection-based file cache sizing")
    we have refault detection for the page cache. Along with swapin events,
    they are good indicators of when the file or anon list, respectively, is
    too small for its workingset and needs to grow.

    For example, if the page cache is thrashing, the cache pages need more
    time in memory, while there may be colder pages on the anonymous list.
    Likewise, if swapped pages are faulting back in, it indicates that we
    reclaim anonymous pages too aggressively and should back off.

    Replace LRU rotations with refaults and swapins as the basis for relative
    reclaim cost of the two LRUs. This will have the VM target list balances
    that incur the least amount of IO on aggregate.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We activate cache refaults with reuse distances in pages smaller than the
    size of the total cache. This allows new pages with competitive access
    frequencies to establish themselves, as well as challenge and potentially
    displace pages on the active list that have gone cold.

    However, that assumes that active cache can only replace other active
    cache in a competition for the hottest memory. This is not a great
    default assumption. The page cache might be thrashing while there are
    enough completely cold and unused anonymous pages sitting around that we'd
    only have to write to swap once to stop all IO from the cache.

    Activate cache refaults when their reuse distance in pages is smaller than
    the total userspace workingset, including anonymous pages.

    Reclaim can still decide how to balance pressure among the two LRUs
    depending on the IO situation. Rotational drives will prefer avoiding
    random IO from swap and go harder after cache. But fundamentally, hot
    cache should be able to compete with anon pages for a place in RAM.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-6-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

02 Dec, 2019

2 commits

  • We use refault information to determine whether the cache workingset is
    stable or transitioning, and dynamically adjust the inactive:active file
    LRU ratio so as to maximize protection from one-off cache during stable
    periods, and minimize IO during transitions.

    With cgroups and their nested LRU lists, we currently don't do this
    correctly. While recursive cgroup reclaim establishes a relative LRU
    order among the pages of all involved cgroups, refaults only affect the
    local LRU order in the cgroup in which they are occuring. As a result,
    cache transitions can take longer in a cgrouped system as the active pages
    of sibling cgroups aren't challenged when they should be.

    [ Right now, this is somewhat theoretical, because the siblings, under
    continued regular reclaim pressure, should eventually run out of
    inactive pages - and since inactive:active *size* balancing is also
    done on a cgroup-local level, we will challenge the active pages
    eventually in most cases. But the next patch will move that relative
    size enforcement to the reclaim root as well, and then this patch
    here will be necessary to propagate refault pressure to siblings. ]

    This patch moves refault detection to the root of reclaim. Instead of
    remembering the cgroup owner of an evicted page, remember the cgroup that
    caused the reclaim to happen. When refaults later occur, they'll
    correctly influence the cross-cgroup LRU order that reclaim follows.

    I.e. if global reclaim kicked out pages in some subgroup A/B/C, the
    refault of those pages will challenge the global LRU order, and not just
    the local order down inside C.

    [hannes@cmpxchg.org: use page_memcg() instead of another lookup]
    Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org
    Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Suren Baghdasaryan
    Cc: Andrey Ryabinin
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There is a per-memcg lruvec and a NUMA node lruvec. Which one is being
    used is somewhat confusing right now, and it's easy to make mistakes -
    especially when it comes to global reclaim.

    How it works: when memory cgroups are enabled, we always use the
    root_mem_cgroup's per-node lruvecs. When memory cgroups are not compiled
    in or disabled at runtime, we use pgdat->lruvec.

    Document that in a comment.

    Due to the way the reclaim code is generalized, all lookups use the
    mem_cgroup_lruvec() helper function, and nobody should have to find the
    right lruvec manually right now. But to avoid future mistakes, rename the
    pgdat->lruvec member to pgdat->__lruvec and delete the convenience wrapper
    that suggests it's a commonly accessed member.

    While in this area, swap the mem_cgroup_lruvec() argument order. The name
    suggests a memcg operation, yet it takes a pgdat first and a memcg second.
    I have to double take every time I call this. Fix that.

    Link: http://lkml.kernel.org/r/20191022144803.302233-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Shakeel Butt
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

14 Aug, 2019

1 commit

  • Memcg counters for shadow nodes are broken because the memcg pointer is
    obtained in a wrong way. The following approach is used:
    virt_to_page(xa_node)->mem_cgroup

    Since commit 4d96ba353075 ("mm: memcg/slab: stop setting
    page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
    set for slab pages, so memcg_from_slab_page() should be used instead.

    Also I doubt that it ever worked correctly: virt_to_head_page() should
    be used instead of virt_to_page(). Otherwise objects residing on tail
    pages are not accounted, because only the head page contains a valid
    mem_cgroup pointer. That was a case since the introduction of these
    counters by the commit 68d48e6a2df5 ("mm: workingset: add vmstat counter
    for shadow nodes").

    Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
    Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Shakeel Butt
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

15 May, 2019

2 commits

  • Patch series "mm: memcontrol: memory.stat cost & correctness".

    The cgroup memory.stat file holds recursive statistics for the entire
    subtree. The current implementation does this tree walk on-demand
    whenever the file is read. This is giving us problems in production.

    1. The cost of aggregating the statistics on-demand is high. A lot of
    system service cgroups are mostly idle and their stats don't change
    between reads, yet we always have to check them. There are also always
    some lazily-dying cgroups sitting around that are pinned by a handful
    of remaining page cache; the same applies to them.

    In an application that periodically monitors memory.stat in our
    fleet, we have seen the aggregation consume up to 5% CPU time.

    2. When cgroups die and disappear from the cgroup tree, so do their
    accumulated vm events. The result is that the event counters at
    higher-level cgroups can go backwards and confuse some of our
    automation, let alone people looking at the graphs over time.

    To address both issues, this patch series changes the stat
    implementation to spill counts upwards when the counters change.

    The upward spilling is batched using the existing per-cpu cache. In a
    sparse file stress test with 5 level cgroup nesting, the additional cost
    of the flushing was negligible (a little under 1% of CPU at 100% CPU
    utilization, compared to the 5% of reading memory.stat during regular
    operation).

    This patch (of 4):

    memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
    currently returning the state of the local memcg or lruvec, not the
    recursive state.

    In practice there is a demand for both versions, although the callers
    that want the recursive counts currently sum them up by hand.

    Per default, cgroups are considered recursive entities and generally we
    expect more users of the recursive counters, with the local counts being
    special cases. To reflect that in the name, add a _local suffix to the
    current implementations.

    The following patch will re-incarnate these functions with recursive
    semantics, but with an O(1) implementation.

    [hannes@cmpxchg.org: fix bisection hole]
    Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
    Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mem_cgroup_node_nr_lru_pages() is just a convenience wrapper around
    lruvec_page_state() that takes bitmasks of lru indexes and aggregates the
    counts for those.

    Replace callsites where the bitmask is simple enough with direct
    lruvec_page_state() calls.

    This removes the last extern user of mem_cgroup_node_nr_lru_pages(), so
    make that function private again, too.

    Link: http://lkml.kernel.org/r/20190228163020.24100-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

06 Mar, 2019

1 commit

  • workingset_eviction() doesn't use and never did use the @mapping
    argument. Remove it.

    Link: http://lkml.kernel.org/r/20190228083329.31892-1-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Cc: William Kucharski
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

29 Dec, 2018

1 commit

  • totalram_pages and totalhigh_pages are made static inline function.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

29 Oct, 2018

1 commit

  • Pull XArray conversion from Matthew Wilcox:
    "The XArray provides an improved interface to the radix tree data
    structure, providing locking as part of the API, specifying GFP flags
    at allocation time, eliminating preloading, less re-walking the tree,
    more efficient iterations and not exposing RCU-protected pointers to
    its users.

    This patch set

    1. Introduces the XArray implementation

    2. Converts the pagecache to use it

    3. Converts memremap to use it

    The page cache is the most complex and important user of the radix
    tree, so converting it was most important. Converting the memremap
    code removes the only other user of the multiorder code, which allows
    us to remove the radix tree code that supported it.

    I have 40+ followup patches to convert many other users of the radix
    tree over to the XArray, but I'd like to get this part in first. The
    other conversions haven't been in linux-next and aren't suitable for
    applying yet, but you can see them in the xarray-conv branch if you're
    interested"

    * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
    radix tree: Remove multiorder support
    radix tree test: Convert multiorder tests to XArray
    radix tree tests: Convert item_delete_rcu to XArray
    radix tree tests: Convert item_kill_tree to XArray
    radix tree tests: Move item_insert_order
    radix tree test suite: Remove multiorder benchmarking
    radix tree test suite: Remove __item_insert
    memremap: Convert to XArray
    xarray: Add range store functionality
    xarray: Move multiorder_check to in-kernel tests
    xarray: Move multiorder_shrink to kernel tests
    xarray: Move multiorder account test in-kernel
    radix tree test suite: Convert iteration test to XArray
    radix tree test suite: Convert tag_tagged_items to XArray
    radix tree: Remove radix_tree_clear_tags
    radix tree: Remove radix_tree_maybe_preload_order
    radix tree: Remove split/join code
    radix tree: Remove radix_tree_update_node_t
    page cache: Finish XArray conversion
    dax: Convert page fault handlers to XArray
    ...

    Linus Torvalds
     

27 Oct, 2018

5 commits

  • The page cache and most shrinkable slab caches hold data that has been
    read from disk, but there are some caches that only cache CPU work, such
    as the dentry and inode caches of procfs and sysfs, as well as the subset
    of radix tree nodes that track non-resident page cache.

    Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
    the shrinker's seeks setting tells the reclaim algorithm that for every
    two page cache pages scanned it should scan one slab object.

    This is a bogus setting. A virtual inode that required no IO to create is
    not twice as valuable as a page cache page; shadow cache entries with
    eviction distances beyond the size of memory aren't either.

    In most cases, the behavior in practice is still fine. Such virtual
    caches don't tend to grow and assert themselves aggressively, and usually
    get picked up before they cause problems. But there are scenarios where
    that's not true.

    Our database workloads suffer from two of those. For one, their file
    workingset is several times bigger than available memory, which has the
    kernel aggressively create shadow page cache entries for the non-resident
    parts of it. The workingset code does tell the VM that most of these are
    expendable, but the VM ends up balancing them 2:1 to cache pages as per
    the seeks setting. This is a huge waste of memory.

    These workloads also deal with tens of thousands of open files and use
    /proc for introspection, which ends up growing the proc_inode_cache to
    absurdly large sizes - again at the cost of valuable cache space, which
    isn't a reasonable trade-off, given that proc inodes can be re-created
    without involving the disk.

    This patch implements a "zero-seek" setting for shrinkers that results in
    a target ratio of 0:1 between their objects and IO-backed caches. This
    allows such virtual caches to grow when memory is available (they do
    cache/avoid CPU work after all), but effectively disables them as soon as
    IO-backed objects are under pressure.

    It then switches the shrinkers for procfs and sysfs metadata, as well as
    excess page cache shadow nodes, to the new zero-seek setting.

    Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Domas Mituzas
    Reviewed-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Make it easier to catch bugs in the shadow node shrinker by adding a
    counter for the shadow nodes in circulation.

    [akpm@linux-foundation.org: assert that irqs are disabled, for __inc_lruvec_page_state()]
    [akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/, per Johannes]
    Link: http://lkml.kernel.org/r/20181009184732.762-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Andrew Morton
    Acked-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • No need to use the preemption-safe lruvec state function inside the
    reclaim region that has irqs disabled.

    Link: http://lkml.kernel.org/r/20181009184732.762-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Refaults happen during transitions between workingsets as well as in-place
    thrashing. Knowing the difference between the two has a range of
    applications, including measuring the impact of memory shortage on the
    system performance, as well as the ability to smarter balance pressure
    between the filesystem cache and the swap-backed workingset.

    During workingset transitions, inactive cache refaults and pushes out
    established active cache. When that active cache isn't stale, however,
    and also ends up refaulting, that's bonafide thrashing.

    Introduce a new page flag that tells on eviction whether the page has been
    active or not in its lifetime. This bit is then stored in the shadow
    entry, to classify refaults as transitioning or thrashing.

    How many page->flags does this leave us with on 32-bit?

    20 bits are always page flags

    21 if you have an MMU

    23 with the zone bits for DMA, Normal, HighMem, Movable

    29 with the sparsemem section bits

    30 if PAE is enabled

    31 with this patch.

    So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
    that's not enough, the system can switch to discontigmem and re-gain the 6
    or 7 sparsemem section bits.

    Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "psi: pressure stall information for CPU, memory, and IO", v4.

    Overview

    PSI reports the overall wallclock time in which the tasks in a system (or
    cgroup) wait for (contended) hardware resources.

    This helps users understand the resource pressure their workloads are
    under, which allows them to rootcause and fix throughput and latency
    problems caused by overcommitting, underprovisioning, suboptimal job
    placement in a grid; as well as anticipate major disruptions like OOM.

    Real-world applications

    We're using the data collected by PSI (and its previous incarnation,
    memdelay) quite extensively at Facebook, and with several success stories.

    One usecase is avoiding OOM hangs/livelocks. The reason these happen is
    because the OOM killer is triggered by reclaim not being able to free
    pages, but with fast flash devices there is *always* some clean and
    uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
    spend 90% of the time thrashing the cache pages of their own executables.
    There is no situation where this ever makes sense in practice. We wrote a

    Acked-by: Peter Zijlstra (Intel)
    Reviewed-by: Rik van Riel
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Cc: Christopher Lameter
    Cc: Peter Enderborg
    Cc: Shakeel Butt
    Cc: Mike Galbraith
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 Oct, 2018

2 commits

  • We construct an XA_STATE and use it to delete the node with
    xas_store() rather than adding a special function for this unique
    use case. Includes a test that simulates this usage for the
    test suite.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • This is a direct replacement for struct radix_tree_node. A couple of
    struct members have changed name, so convert those. Use a #define so
    that radix tree users continue to work without change.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

30 Sep, 2018

1 commit

  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox