27 Apr, 2022

1 commit

  • commit 2dfe63e61cc31ee59ce951672b0850b5229cd5b0 upstream.

    Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been
    producing garbage data due to the object not actually being maintained
    by SLAB or SLUB.

    Fix this by implementing __kfence_obj_info() that copies relevant
    information to struct kmem_obj_info when the object was allocated by
    KFENCE; this is called by a common kmem_obj_info(), which also calls the
    slab/slub/slob specific variant now called __kmem_obj_info().

    For completeness, kmem_dump_obj() now displays if the object was
    allocated by KFENCE.

    Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/
    Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com
    Fixes: b89fb5ef0ce6 ("mm, kfence: insert KFENCE hooks for SLUB")
    Fixes: d3fb45f370d9 ("mm, kfence: insert KFENCE hooks for SLAB")
    Signed-off-by: Marco Elver
    Reviewed-by: Hyeonggon Yoo
    Reported-by: kernel test robot
    Acked-by: Vlastimil Babka [slab]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Marco Elver
     

25 Nov, 2021

1 commit

  • commit 34dbc3aaf5d9e89ba6cc5e24add9458c21ab1950 upstream.

    When kmemleak is enabled for SLOB, system does not boot and does not
    print anything to the console. At the very early stage in the boot
    process we hit infinite recursion from kmemleak_init() and eventually
    kernel crashes.

    kmemleak_init() specifies SLAB_NOLEAKTRACE for KMEM_CACHE(), but
    kmem_cache_create_usercopy() removes it because CACHE_CREATE_MASK is not
    valid for SLOB.

    Let's fix CACHE_CREATE_MASK and make kmemleak work with SLOB

    Link: https://lkml.kernel.org/r/20211115020850.3154366-1-rkovhaev@gmail.com
    Fixes: d8843922fba4 ("slab: Ignore internal flags in cache creation")
    Signed-off-by: Rustam Kovhaev
    Acked-by: Vlastimil Babka
    Reviewed-by: Muchun Song
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Catalin Marinas
    Cc: Greg Kroah-Hartman
    Cc: Glauber Costa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Rustam Kovhaev
     

31 Jul, 2021

1 commit

  • When I use kfree_rcu() to free a large memory allocated by kmalloc_node(),
    the following dump occurs.

    BUG: kernel NULL pointer dereference, address: 0000000000000020
    [...]
    Oops: 0000 [#1] SMP
    [...]
    Workqueue: events kfree_rcu_work
    RIP: 0010:__obj_to_index include/linux/slub_def.h:182 [inline]
    RIP: 0010:obj_to_index include/linux/slub_def.h:191 [inline]
    RIP: 0010:memcg_slab_free_hook+0x120/0x260 mm/slab.h:363
    [...]
    Call Trace:
    kmem_cache_free_bulk+0x58/0x630 mm/slub.c:3293
    kfree_bulk include/linux/slab.h:413 [inline]
    kfree_rcu_work+0x1ab/0x200 kernel/rcu/tree.c:3300
    process_one_work+0x207/0x530 kernel/workqueue.c:2276
    worker_thread+0x320/0x610 kernel/workqueue.c:2422
    kthread+0x13d/0x160 kernel/kthread.c:313
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

    When kmalloc_node() a large memory, page is allocated, not slab, so when
    freeing memory via kfree_rcu(), this large memory should not be used by
    memcg_slab_free_hook(), because memcg_slab_free_hook() is is used for
    slab.

    Using page_objcgs_check() instead of page_objcgs() in
    memcg_slab_free_hook() to fix this bug.

    Link: https://lkml.kernel.org/r/20210728145655.274476-1-wanghai38@huawei.com
    Fixes: 270c6a71460e ("mm: memcontrol/slab: Use helpers to access slab page's memcg_data")
    Signed-off-by: Wang Hai
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Reviewed-by: Kefeng Wang
    Reviewed-by: Muchun Song
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Alexei Starovoitov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Hai
     

16 Jul, 2021

1 commit

  • Move the helper to check slub_debug_enabled, so that we can confine the
    use of #ifdef outside slub.c as well.

    Link: https://lkml.kernel.org/r/20210705103229.8505-2-yee.lee@mediatek.com
    Signed-off-by: Marco Elver
    Signed-off-by: Yee Lee
    Suggested-by: Matthew Wilcox
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Chinwen Chang
    Cc: Dmitry Vyukov
    Cc: Kuan-Ying Lee
    Cc: Nicholas Tang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marco Elver
     

05 Jul, 2021

1 commit

  • …git/paulmck/linux-rcu

    Pull RCU updates from Paul McKenney:

    - Bitmap parsing support for "all" as an alias for all bits

    - Documentation updates

    - Miscellaneous fixes, including some that overlap into mm and lockdep

    - kvfree_rcu() updates

    - mem_dump_obj() updates, with acks from one of the slab-allocator
    maintainers

    - RCU NOCB CPU updates, including limited deoffloading

    - SRCU updates

    - Tasks-RCU updates

    - Torture-test updates

    * 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits)
    tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
    rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
    rcu: Add missing __releases() annotation
    rcu: Remove obsolete rcu_read_unlock() deadlock commentary
    rcu: Improve comments describing RCU read-side critical sections
    rcu: Create an unrcu_pointer() to remove __rcu from a pointer
    srcu: Early test SRCU polling start
    rcu: Fix various typos in comments
    rcu/nocb: Unify timers
    rcu/nocb: Prepare for fine-grained deferred wakeup
    rcu/nocb: Only cancel nocb timer if not polling
    rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
    rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
    rcu/nocb: Allow de-offloading rdp leader
    rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
    rcu: Don't penalize priority boosting when there is nothing to boost
    rcu: Point to documentation of ordering guarantees
    rcu: Make rcu_gp_cleanup() be noinline for tracing
    rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
    rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
    ...

    Linus Torvalds
     

30 Jun, 2021

4 commits

  • Patch series "mm: memcg/slab: Fix objcg pointer array handling problem", v4.

    Since the merging of the new slab memory controller in v5.9, the page
    structure stores a pointer to objcg pointer array for slab pages. When
    the slab has no used objects, it can be freed in free_slab() which will
    call kfree() to free the objcg pointer array in
    memcg_alloc_page_obj_cgroups(). If it happens that the objcg pointer
    array is the last used object in its slab, that slab may then be freed
    which may caused kfree() to be called again.

    With the right workload, the slab cache may be set up in a way that allows
    the recursive kfree() calling loop to nest deep enough to cause a kernel
    stack overflow and panic the system. In fact, we have a reproducer that
    can cause kernel stack overflow on a s390 system involving kmalloc-rcl-256
    and kmalloc-rcl-128 slabs with the following kfree() loop recursively
    called 74 times:

    [ 285.520739] [] kfree+0x4bc/0x560 [ 285.520740]
    [] __free_slab+0xc6/0x228 [ 285.520741]
    [] __slab_free+0x3c2/0x3e0 [ 285.520742]
    [] kfree+0x4bc/0x560 : While investigating this issue, I
    also found an issue on the allocation side. If the objcg pointer array
    happen to come from the same slab or a circular dependency linkage is
    formed with multiple slabs, those affected slabs can never be freed again.

    This patch series addresses these two issues by introducing a new set of
    kmalloc-cg- caches split from kmalloc- caches. The new set will
    only contain non-reclaimable and non-dma objects that are accounted in
    memory cgroups whereas the old set are now for unaccounted objects only.
    By making this split, all the objcg pointer arrays will come from the
    kmalloc- caches, but those caches will never hold any objcg pointer
    array. As a result, deeply nested kfree() call and the unfreeable slab
    problems are now gone.

    This patch (of 4):

    Since the merging of the new slab memory controller in v5.9, the page
    structure may store a pointer to obj_cgroup pointer array for slab pages.
    Currently, only the __GFP_ACCOUNT bit is masked off. However, the array
    is not readily reclaimable and doesn't need to come from the DMA buffer.
    So those GFP bits should be masked off as well.

    Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure
    that it is consistently applied no matter where it is called.

    Link: https://lkml.kernel.org/r/20210505200610.13943-1-longman@redhat.com
    Link: https://lkml.kernel.org/r/20210505200610.13943-2-longman@redhat.com
    Fixes: 286e04b8ed7a ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages")
    Signed-off-by: Waiman Long
    Reviewed-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Reviewed-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Patch series "mm/memcg: Reduce kmemcache memory accounting overhead", v6.

    With the recent introduction of the new slab memory controller, we
    eliminate the need for having separate kmemcaches for each memory cgroup
    and reduce overall kernel memory usage. However, we also add additional
    memory accounting overhead to each call of kmem_cache_alloc() and
    kmem_cache_free().

    For workloads that require a lot of kmemcache allocations and
    de-allocations, they may experience performance regression as illustrated
    in [1] and [2].

    A simple kernel module that performs repeated loop of 100,000,000
    kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object
    or a big 4k object at module init time with a batch size of 4 (4 kmalloc's
    followed by 4 kfree's) is used for benchmarking. The benchmarking tool
    was run on a kernel based on linux-next-20210419. The test was run on a
    CascadeLake server with turbo-boosting disable to reduce run-to-run
    variation.

    The small object test exercises mainly the object stock charging and
    vmstat update code paths. The large object test also exercises the
    refill_obj_stock() and __memcg_kmem_charge()/__memcg_kmem_uncharge() code
    paths.

    With memory accounting disabled, the run time was 3.130s with both small
    object big object tests.

    With memory accounting enabled, both cgroup v1 and v2 showed similar
    results in the small object test. The performance results of the large
    object test, however, differed between cgroup v1 and v2.

    The execution times with the application of various patches in the
    patchset were:

    Applied patches Run time Accounting overhead %age 1 %age 2
    --------------- -------- ------------------- ------ ------

    Small 32-byte object:
    None 11.634s 8.504s 100.0% 271.7%
    1-2 9.425s 6.295s 74.0% 201.1%
    1-3 9.708s 6.578s 77.4% 210.2%
    1-4 8.062s 4.932s 58.0% 157.6%

    Large 4k object (v2):
    None 22.107s 18.977s 100.0% 606.3%
    1-2 20.960s 17.830s 94.0% 569.6%
    1-3 14.238s 11.108s 58.5% 354.9%
    1-4 11.329s 8.199s 43.2% 261.9%

    Large 4k object (v1):
    None 36.807s 33.677s 100.0% 1075.9%
    1-2 36.648s 33.518s 99.5% 1070.9%
    1-3 22.345s 19.215s 57.1% 613.9%
    1-4 18.662s 15.532s 46.1% 496.2%

    N.B. %age 1 = overhead/unpatched overhead
    %age 2 = overhead/accounting disabled time

    Patch 2 (vmstat data stock caching) helps in both the small object test
    and the large v2 object test. It doesn't help much in v1 big object test.

    Patch 3 (refill_obj_stock improvement) does help the small object test
    but offer significant performance improvement for the large object test
    (both v1 and v2).

    Patch 4 (eliminating irq disable/enable) helps in all test cases.

    To test for the extreme case, a multi-threaded kmalloc/kfree
    microbenchmark was run on the 2-socket 48-core 96-thread system with
    96 testing threads in the same memcg doing kmalloc+kfree of a 4k object
    with accounting enabled for 10s. The total number of kmalloc+kfree done
    in kilo operations per second (kops/s) were as follows:

    Applied patches v1 kops/s v1 change v2 kops/s v2 change
    --------------- --------- --------- --------- ---------
    None 3,520 1.00X 6,242 1.00X
    1-2 4,304 1.22X 8,478 1.36X
    1-3 4,731 1.34X 418,142 66.99X
    1-4 4,587 1.30X 438,838 70.30X

    With memory accounting disabled, the kmalloc/kfree rate was 1,481,291
    kop/s. This test shows how significant the memory accouting overhead
    can be in some extreme situations.

    For this multithreaded test, the improvement from patch 2 mainly
    comes from the conditional atomic xchg of objcg->nr_charged_bytes in
    mod_objcg_state(). By using an unconditional xchg, the operation rates
    were similar to the unpatched kernel.

    Patch 3 elminates the single highly contended cacheline of
    objcg->nr_charged_bytes for cgroup v2 leading to a huge performance
    improvement. Cgroup v1, however, still has another highly contended
    cacheline in the shared page counter &memcg->kmem. So the improvement
    is only modest.

    Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as
    eliminating the irq_disable/irq_enable overhead seems to aggravate the
    cacheline contention.

    [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
    [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/

    This patch (of 4):

    mod_objcg_state() is moved from mm/slab.h to mm/memcontrol.c so that
    further optimization can be done to it in later patches without exposing
    unnecessary details to other mm components.

    Link: https://lkml.kernel.org/r/20210506150007.16288-1-longman@redhat.com
    Link: https://lkml.kernel.org/r/20210506150007.16288-2-longman@redhat.com
    Signed-off-by: Waiman Long
    Acked-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Cc: Alex Shi
    Cc: Chris Down
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Masayoshi Mizuma
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Muchun Song
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Xing Zhengjun
    Cc: Yafang Shao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • alloc_calls and free_calls implementation in sysfs have two issues, one is
    PAGE_SIZE limitation of sysfs and other is it does not adhere to "one
    value per file" rule.

    To overcome this issues, move the alloc_calls and free_calls
    implementation to debugfs.

    Debugfs cache will be created if SLAB_STORE_USER flag is set.

    Rename the alloc_calls/free_calls to alloc_traces/free_traces, to be
    inline with what it does.

    [faiyazm@codeaurora.org: fix the leak of alloc/free traces debugfs interface]
    Link: https://lkml.kernel.org/r/1624248060-30286-1-git-send-email-faiyazm@codeaurora.org

    Link: https://lkml.kernel.org/r/1623438200-19361-1-git-send-email-faiyazm@codeaurora.org
    Signed-off-by: Faiyaz Mohammed
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Greg Kroah-Hartman
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Faiyaz Mohammed
     
  • SLUB has resiliency_test() function which is hidden behind #ifdef
    SLUB_RESILIENCY_TEST that is not part of Kconfig, so nobody runs it.
    KUnit should be a proper replacement for it.

    Try changing byte in redzone after allocation and changing pointer to next
    free node, first byte, 50th byte and redzone byte. Check if validation
    finds errors.

    There are several differences from the original resiliency test: Tests
    create own caches with known state instead of corrupting shared kmalloc
    caches.

    The corruption of freepointer uses correct offset, the original resiliency
    test got broken with freepointer changes.

    Scratch changing random byte test, because it does not have meaning in
    this form where we need deterministic results.

    Add new option CONFIG_SLUB_KUNIT_TEST in Kconfig. Tests next_pointer,
    first_word and clobber_50th_byte do not run with KASAN option on. Because
    the test deliberately modifies non-allocated objects.

    Use kunit_resource to count errors in cache and silence bug reports.
    Count error whenever slab_bug() or slab_fix() is called or when the count
    of pages is wrong.

    [glittao@gmail.com: remove unused function test_exit(), from SLUB KUnit test]
    Link: https://lkml.kernel.org/r/20210512140656.12083-1-glittao@gmail.com
    [akpm@linux-foundation.org: export kasan_enable/disable_current to modules]

    Link: https://lkml.kernel.org/r/20210511150734.3492-2-glittao@gmail.com
    Signed-off-by: Oliver Glitta
    Reviewed-by: Vlastimil Babka
    Acked-by: Daniel Latypov
    Acked-by: Marco Elver
    Cc: Brendan Higgins
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oliver Glitta
     

11 May, 2021

1 commit

  • This commit adds enables a stack dump for the last free of an object:

    slab kmalloc-64 start c8ab0140 data offset 64 pointer offset 0 size 64 allocated at meminfo_proc_show+0x40/0x4fc
    [ 20.192078] meminfo_proc_show+0x40/0x4fc
    [ 20.192263] seq_read_iter+0x18c/0x4c4
    [ 20.192430] proc_reg_read_iter+0x84/0xac
    [ 20.192617] generic_file_splice_read+0xe8/0x17c
    [ 20.192816] splice_direct_to_actor+0xb8/0x290
    [ 20.193008] do_splice_direct+0xa0/0xe0
    [ 20.193185] do_sendfile+0x2d0/0x438
    [ 20.193345] sys_sendfile64+0x12c/0x140
    [ 20.193523] ret_fast_syscall+0x0/0x58
    [ 20.193695] 0xbeeacde4
    [ 20.193822] Free path:
    [ 20.193935] meminfo_proc_show+0x5c/0x4fc
    [ 20.194115] seq_read_iter+0x18c/0x4c4
    [ 20.194285] proc_reg_read_iter+0x84/0xac
    [ 20.194475] generic_file_splice_read+0xe8/0x17c
    [ 20.194685] splice_direct_to_actor+0xb8/0x290
    [ 20.194870] do_splice_direct+0xa0/0xe0
    [ 20.195014] do_sendfile+0x2d0/0x438
    [ 20.195174] sys_sendfile64+0x12c/0x140
    [ 20.195336] ret_fast_syscall+0x0/0x58
    [ 20.195491] 0xbeeacde4

    Acked-by: Vlastimil Babka
    Co-developed-by: Vaneet Narang
    Signed-off-by: Vaneet Narang
    Signed-off-by: Maninder Singh
    Signed-off-by: Paul E. McKenney

    Maninder Singh
     

01 May, 2021

1 commit

  • This change uses the previously added memory initialization feature of
    HW_TAGS KASAN routines for slab memory when init_on_alloc is enabled.

    With this change, memory initialization memset() is no longer called when
    both HW_TAGS KASAN and init_on_alloc are enabled. Instead, memory is
    initialized in KASAN runtime.

    The memory initialization memset() is moved into slab_post_alloc_hook()
    that currently directly follows the initialization loop. A new argument
    is added to slab_post_alloc_hook() that indicates whether to initialize
    the memory or not.

    To avoid discrepancies with which memory gets initialized that can be
    caused by future changes, both KASAN hook and initialization memset() are
    put together and a warning comment is added.

    Combining setting allocation tags with memory initialization improves
    HW_TAGS KASAN performance when init_on_alloc is enabled.

    Link: https://lkml.kernel.org/r/c1292aeb5d519da221ec74a0684a949b027d7720.1615296150.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Marco Elver
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Branislav Rankov
    Cc: Catalin Marinas
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dmitry Vyukov
    Cc: Evgenii Stepanov
    Cc: Joonsoo Kim
    Cc: Kevin Brodsky
    Cc: Pekka Enberg
    Cc: Peter Collingbourne
    Cc: Vincenzo Frascino
    Cc: Vlastimil Babka
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

29 Apr, 2021

1 commit

  • Pull RCU updates from Ingo Molnar:

    - Support for "N" as alias for last bit in bitmap parsing library (eg
    using syntax like "nohz_full=2-N")

    - kvfree_rcu updates

    - mm_dump_obj() updates. (One of these is to mm, but was suggested by
    Andrew Morton.)

    - RCU callback offloading update

    - Polling RCU grace-period interfaces

    - Realtime-related RCU updates

    - Tasks-RCU updates

    - Torture-test updates

    - Torture-test scripting updates

    - Miscellaneous fixes

    * tag 'core-rcu-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
    rcutorture: Test start_poll_synchronize_rcu() and poll_state_synchronize_rcu()
    rcu: Provide polling interfaces for Tiny RCU grace periods
    torture: Fix kvm.sh --datestamp regex check
    torture: Consolidate qemu-cmd duration editing into kvm-transform.sh
    torture: Print proper vmlinux path for kvm-again.sh runs
    torture: Make TORTURE_TRUST_MAKE available in kvm-again.sh environment
    torture: Make kvm-transform.sh update jitter commands
    torture: Add --duration argument to kvm-again.sh
    torture: Add kvm-again.sh to rerun a previous torture-test
    torture: Create a "batches" file for build reuse
    torture: De-capitalize TORTURE_SUITE
    torture: Make upper-case-only no-dot no-slash scenario names official
    torture: Rename SRCU-t and SRCU-u to avoid lowercase characters
    torture: Remove no-mpstat error message
    torture: Record kvm-test-1-run.sh and kvm-test-1-run-qemu.sh PIDs
    torture: Record jitter start/stop commands
    torture: Extract kvm-test-1-run-qemu.sh from kvm-test-1-run.sh
    torture: Record TORTURE_KCONFIG_GDB_ARG in qemu-cmd
    torture: Abstract jitter.sh start/stop into scripts
    rcu: Provide polling interfaces for Tree RCU grace periods
    ...

    Linus Torvalds
     

08 Apr, 2021

1 commit

  • The state of CONFIG_INIT_ON_ALLOC_DEFAULT_ON (and ...ON_FREE...) did not
    change the assembly ordering of the static branches: they were always out
    of line. Use the new jump_label macros to check the CONFIG settings to
    default to the "expected" state, which slightly optimizes the resulting
    assembly code.

    Signed-off-by: Kees Cook
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexander Potapenko
    Acked-by: Vlastimil Babka
    Link: https://lore.kernel.org/r/20210401232347.2791257-3-keescook@chromium.org

    Kees Cook
     

09 Mar, 2021

1 commit

  • The mem_dump_obj() functionality adds a few hundred bytes, which is a
    small price to pay. Except on kernels built with CONFIG_PRINTK=n, in
    which mem_dump_obj() messages will be suppressed. This commit therefore
    makes mem_dump_obj() be a static inline empty function on kernels built
    with CONFIG_PRINTK=n and excludes all of its support functions as well.
    This avoids kernel bloat on systems that cannot use mem_dump_obj().

    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Suggested-by: Andrew Morton
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

25 Feb, 2021

2 commits

  • In general it's unknown in advance if a slab page will contain accounted
    objects or not. In order to avoid memory waste, an obj_cgroup vector is
    allocated dynamically when a need to account of a new object arises. Such
    approach is memory efficient, but requires an expensive cmpxchg() to set
    up the memcg/objcgs pointer, because an allocation can race with a
    different allocation on another cpu.

    But in some common cases it's known for sure that a slab page will contain
    accounted objects: if the page belongs to a slab cache with a SLAB_ACCOUNT
    flag set. It includes such popular objects like vm_area_struct, anon_vma,
    task_struct, etc.

    In such cases we can pre-allocate the objcgs vector and simple assign it
    to the page without any atomic operations, because at this early stage the
    page is not visible to anyone else.

    A very simplistic benchmark (allocating 10000000 64-bytes objects in a
    row) shows ~15% win. In the real life it seems that most workloads are
    not very sensitive to the speed of (accounted) slab allocations.

    [guro@fb.com: open-code set_page_objcgs() and add some comments, by Johannes]
    Link: https://lkml.kernel.org/r/20201113001926.GA2934489@carbon.dhcp.thefacebook.com
    [akpm@linux-foundation.org: fix it for mm-slub-call-account_slab_page-after-slab-page-initialization-fix.patch]

    Link: https://lkml.kernel.org/r/20201110195753.530157-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This argument hasn't been used since e153362a50a3 ("slub: Remove objsize
    check in kmem_cache_flags()") so simply remove it.

    Link: https://lkml.kernel.org/r/20210126095733.974665-1-nborisov@suse.com
    Signed-off-by: Nikolay Borisov
    Reviewed-by: Miaohe Lin
    Reviewed-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikolay Borisov
     

23 Jan, 2021

1 commit

  • There are kernel facilities such as per-CPU reference counts that give
    error messages in generic handlers or callbacks, whose messages are
    unenlightening. In the case of per-CPU reference-count underflow, this
    is not a problem when creating a new use of this facility because in that
    case the bug is almost certainly in the code implementing that new use.
    However, trouble arises when deploying across many systems, which might
    exercise corner cases that were not seen during development and testing.
    Here, it would be really nice to get some kind of hint as to which of
    several uses the underflow was caused by.

    This commit therefore exposes a mem_dump_obj() function that takes
    a pointer to memory (which must still be allocated if it has been
    dynamically allocated) and prints available information on where that
    memory came from. This pointer can reference the middle of the block as
    well as the beginning of the block, as needed by things like RCU callback
    functions and timer handlers that might not know where the beginning of
    the memory block is. These functions and handlers can use mem_dump_obj()
    to print out better hints as to where the problem might lie.

    The information printed can depend on kernel configuration. For example,
    the allocation return address can be printed only for slab and slub,
    and even then only when the necessary debug has been enabled. For slab,
    build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
    to the next power of two or use the SLAB_STORE_USER when creating the
    kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
    boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
    if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
    to enable printing of the allocation-time stack trace.

    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:
    Reported-by: Andrii Nakryiko
    [ paulmck: Convert to printing and change names per Joonsoo Kim. ]
    [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
    [ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
    [ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
    [ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
    [ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
    Acked-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Tested-by: Naresh Kamboju
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

16 Dec, 2020

3 commits

  • Pull networking updates from Jakub Kicinski:
    "Core:

    - support "prefer busy polling" NAPI operation mode, where we defer
    softirq for some time expecting applications to periodically busy
    poll

    - AF_XDP: improve efficiency by more batching and hindering the
    adjacency cache prefetcher

    - af_packet: make packet_fanout.arr size configurable up to 64K

    - tcp: optimize TCP zero copy receive in presence of partial or
    unaligned reads making zero copy a performance win for much smaller
    messages

    - XDP: add bulk APIs for returning / freeing frames

    - sched: support fragmenting IP packets as they come out of conntrack

    - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs

    BPF:

    - BPF switch from crude rlimit-based to memcg-based memory accounting

    - BPF type format information for kernel modules and related tracing
    enhancements

    - BPF implement task local storage for BPF LSM

    - allow the FENTRY/FEXIT/RAW_TP tracing programs to use
    bpf_sk_storage

    Protocols:

    - mptcp: improve multiple xmit streams support, memory accounting and
    many smaller improvements

    - TLS: support CHACHA20-POLY1305 cipher

    - seg6: add support for SRv6 End.DT4/DT6 behavior

    - sctp: Implement RFC 6951: UDP Encapsulation of SCTP

    - ppp_generic: add ability to bridge channels directly

    - bridge: Connectivity Fault Management (CFM) support as is defined
    in IEEE 802.1Q section 12.14.

    Drivers:

    - mlx5: make use of the new auxiliary bus to organize the driver
    internals

    - mlx5: more accurate port TX timestamping support

    - mlxsw:
    - improve the efficiency of offloaded next hop updates by using
    the new nexthop object API
    - support blackhole nexthops
    - support IEEE 802.1ad (Q-in-Q) bridging

    - rtw88: major bluetooth co-existance improvements

    - iwlwifi: support new 6 GHz frequency band

    - ath11k: Fast Initial Link Setup (FILS)

    - mt7915: dual band concurrent (DBDC) support

    - net: ipa: add basic support for IPA v4.5

    Refactor:

    - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
    Siewior

    - phy: add support for shared interrupts; get rid of multiple driver
    APIs and have the drivers write a full IRQ handler, slight growth
    of driver code should be compensated by the simpler API which also
    allows shared IRQs

    - add common code for handling netdev per-cpu counters

    - move TX packet re-allocation from Ethernet switch tag drivers to a
    central place

    - improve efficiency and rename nla_strlcpy

    - number of W=1 warning cleanups as we now catch those in a patchwork
    build bot

    Old code removal:

    - wan: delete the DLCI / SDLA drivers

    - wimax: move to staging

    - wifi: remove old WDS wifi bridging support"

    * tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
    net: hns3: fix expression that is currently always true
    net: fix proc_fs init handling in af_packet and tls
    nfc: pn533: convert comma to semicolon
    af_vsock: Assign the vsock transport considering the vsock address flags
    af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
    vsock_addr: Check for supported flag values
    vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
    vm_sockets: Add flags field in the vsock address data structure
    net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
    tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
    net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
    nfc: s3fwrn5: Release the nfc firmware
    net: vxget: clean up sparse warnings
    mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
    mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
    mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
    mlxsw: reg: Add Router LPM Cache Enable Register
    mlxsw: reg: Add Router LPM Cache ML Delete Register
    mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
    mlxsw: reg: Add XM Router M Table Register
    ...

    Linus Torvalds
     
  • Extracted from slab.h, which seems to have the most complete version
    including the correct might_sleep() check. Roll it out to slob.c.

    Motivated by a discussion with Paul about possibly changing call_rcu
    behaviour to allocate memory, but only roughly every 500th call.

    There are a lot fewer places in the kernel that care about whether
    allocating memory is allowed or not (due to deadlocks with reclaim code)
    than places that care whether sleeping is allowed. But debugging these
    also tends to be a lot harder, so nice descriptive checks could come in
    handy. I might have some use eventually for annotations in drivers/gpu.

    Note that unlike fs_reclaim_acquire/release gfpflags_allow_blocking does
    not consult the PF_MEMALLOC flags. But there is no flag equivalent for
    GFP_NOWAIT, hence this check can't go wrong due to
    memalloc_no*_save/restore contexts. Willy is working on a patch series
    which might change this:

    https://lore.kernel.org/linux-mm/20200625113122.7540-7-willy@infradead.org/

    I think best would be if that updates gfpflags_allow_blocking(), since
    there's a ton of callers all over the place for that already.

    Link: https://lkml.kernel.org/r/20201125162532.1299794-3-daniel.vetter@ffwll.ch
    Signed-off-by: Daniel Vetter
    Acked-by: Vlastimil Babka
    Acked-by: Paul E. McKenney
    Reviewed-by: Jason Gunthorpe
    Cc: Randy Dunlap
    Cc: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Vlastimil Babka
    Cc: Mathieu Desnoyers
    Cc: Sebastian Andrzej Siewior
    Cc: Michel Lespinasse
    Cc: Daniel Vetter
    Cc: Waiman Long
    Cc: Thomas Gleixner
    Cc: Randy Dunlap
    Cc: Dave Chinner
    Cc: Qian Cai
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Christian König
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Maarten Lankhorst
    Cc: Thomas Hellström (Intel)
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Vetter
     
  • Since commit 991e7673859e ("mm: memcontrol: account kernel stack per
    node") there is no user of the mod_memcg_obj_state(). So just remove
    it.

    Also rework type of the idx parameter of the mod_objcg_state() from int
    to enum node_stat_item.

    Link: https://lkml.kernel.org/r/20201013153504.92602-1-songmuchun@bytedance.com
    Signed-off-by: Muchun Song
    Acked-by: Roman Gushchin
    Acked-by: David Rientjes
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Christopher Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Yafang Shao
    Cc: Chris Down
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Muchun Song
     

12 Dec, 2020

1 commit


07 Dec, 2020

1 commit

  • Commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches
    for all allocations") introduced a regression into the handling of the
    obj_cgroup_charge() return value. If a non-zero value is returned
    (indicating of exceeding one of memory.max limits), the allocation
    should fail, instead of falling back to non-accounted mode.

    To make the code more readable, move memcg_slab_pre_alloc_hook() and
    memcg_slab_post_alloc_hook() calling conditions into bodies of these
    hooks.

    Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Link: https://lkml.kernel.org/r/20201127161828.GD840171@carbon.dhcp.thefacebook.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

03 Dec, 2020

2 commits

  • To gather all direct accesses to struct page's memcg_data field in one
    place, let's introduce 3 new helpers to use in the slab accounting code:

    struct obj_cgroup **page_objcgs(struct page *page);
    struct obj_cgroup **page_objcgs_check(struct page *page);
    bool set_page_objcgs(struct page *page, struct obj_cgroup **objcgs);

    They are similar to the corresponding API for generic pages, except that
    the setter can return false, indicating that the value has been already
    set from a different thread.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Link: https://lkml.kernel.org/r/20201027001657.3398190-3-guro@fb.com
    Link: https://lore.kernel.org/bpf/20201201215900.3569844-3-guro@fb.com

    Roman Gushchin
     
  • Patch series "mm: allow mapping accounted kernel pages to userspace", v6.

    Currently a non-slab kernel page which has been charged to a memory cgroup
    can't be mapped to userspace. The underlying reason is simple: PageKmemcg
    flag is defined as a page type (like buddy, offline, etc), so it takes a
    bit from a page->mapped counter. Pages with a type set can't be mapped to
    userspace.

    But in general the kmemcg flag has nothing to do with mapping to
    userspace. It only means that the page has been accounted by the page
    allocator, so it has to be properly uncharged on release.

    Some bpf maps are mapping the vmalloc-based memory to userspace, and their
    memory can't be accounted because of this implementation detail.

    This patchset removes this limitation by moving the PageKmemcg flag into
    one of the free bits of the page->mem_cgroup pointer. Also it formalizes
    accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
    adds several checks and removes a couple of obsolete functions. As the
    result the code became more robust with fewer open-coded bit tricks.

    This patch (of 4):

    Currently there are many open-coded reads of the page->mem_cgroup pointer,
    as well as a couple of read helpers, which are barely used.

    It creates an obstacle on a way to reuse some bits of the pointer for
    storing additional bits of information. In fact, we already do this for
    slab pages, where the last bit indicates that a pointer has an attached
    vector of objcg pointers instead of a regular memcg pointer.

    This commits uses 2 existing helpers and introduces a new helper to
    converts all read sides to calls of these helpers:
    struct mem_cgroup *page_memcg(struct page *page);
    struct mem_cgroup *page_memcg_rcu(struct page *page);
    struct mem_cgroup *page_memcg_check(struct page *page);

    page_memcg_check() is intended to be used in cases when the page can be a
    slab page and have a memcg pointer pointing at objcg vector. It does
    check the lowest bit, and if set, returns NULL. page_memcg() contains a
    VM_BUG_ON_PAGE() check for the page not being a slab page.

    To make sure nobody uses a direct access, struct page's
    mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
    Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
    Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com

    Roman Gushchin
     

19 Oct, 2020

1 commit

  • Patch series "mm: kmem: kernel memory accounting in an interrupt context".

    This patchset implements memcg-based memory accounting of allocations made
    from an interrupt context.

    Historically, such allocations were passed unaccounted mostly because
    charging the memory cgroup of the current process wasn't an option. Also
    performance reasons were likely a reason too.

    The remote charging API allows to temporarily overwrite the currently
    active memory cgroup, so that all memory allocations are accounted towards
    some specified memory cgroup instead of the memory cgroup of the current
    process.

    This patchset extends the remote charging API so that it can be used from
    an interrupt context. Then it removes the fence that prevented the
    accounting of allocations made from an interrupt context. It also
    contains a couple of optimizations/code refactorings.

    This patchset doesn't directly enable accounting for any specific
    allocations, but prepares the code base for it. The bpf memory accounting
    will likely be the first user of it: a typical example is a bpf program
    parsing an incoming network packet, which allocates an entry in hashmap
    map to store some information.

    This patch (of 4):

    Currently memcg_kmem_bypass() is called before obtaining the current
    memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving
    memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
    number of call sites and allows further code simplifications.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

17 Oct, 2020

1 commit

  • Remove duplicate header which is included twice.

    Signed-off-by: YueHaibing
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200818114323.58156-1-yuehaibing@huawei.com
    Signed-off-by: Linus Torvalds

    YueHaibing
     

14 Oct, 2020

1 commit

  • Object cgroup charging is done for all the objects during allocation, but
    during freeing, uncharging ends up happening for only one object in the
    case of bulk allocation/freeing.

    Fix this by having a separate call to uncharge all the objects from
    kmem_cache_free_bulk() and by modifying memcg_slab_free_hook() to take
    care of bulk uncharging.

    Fixes: 964d4bd370d5 ("mm: memcg/slab: save obj_cgroup for non-root slab objects"
    Signed-off-by: Bharata B Rao
    Signed-off-by: Andrew Morton
    Acked-by: Roman Gushchin
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc:
    Link: https://lkml.kernel.org/r/20201009060423.390479-1-bharata@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Bharata B Rao
     

08 Aug, 2020

13 commits

  • charge_slab_page() and uncharge_slab_page() are not related anymore to
    memcg charging and uncharging. In order to make their names less
    confusing, let's rename them to account_slab_page() and
    unaccount_slab_page() respectively.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200707173612.124425-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • charge_slab_page() is not using the gfp argument anymore,
    remove it.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200707173612.124425-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Instead of having two sets of kmem_caches: one for system-wide and
    non-accounted allocations and the second one shared by all accounted
    allocations, we can use just one.

    The idea is simple: space for obj_cgroup metadata can be allocated on
    demand and filled only for accounted allocations.

    It allows to remove a bunch of code which is required to handle kmem_cache
    clones for accounted allocations. There is no more need to create them,
    accumulate statistics, propagate attributes, etc. It's a quite
    significant simplification.

    Also, because the total number of slab_caches is reduced almost twice (not
    all kmem_caches have a memcg clone), some additional memory savings are
    expected. On my devvm it additionally saves about 3.5% of slab memory.

    [guro@fb.com: fix build on MIPS]
    Link: http://lkml.kernel.org/r/20200717214810.3733082-1-guro@fb.com

    Suggested-by: Johannes Weiner
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Naresh Kamboju
    Link: http://lkml.kernel.org/r/20200623174037.3951353-18-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently there are two lists of kmem_caches:
    1) slab_caches, which contains all kmem_caches,
    2) slab_root_caches, which contains only root kmem_caches.

    And there is some preprocessor magic to have a single list if
    CONFIG_MEMCG_KMEM isn't enabled.

    It was required earlier because the number of non-root kmem_caches was
    proportional to the number of memory cgroups and could reach really big
    values. Now, when it cannot exceed the number of root kmem_caches, there
    is really no reason to maintain two lists.

    We never iterate over the slab_root_caches list on any hot paths, so it's
    perfectly fine to iterate over slab_caches and filter out non-root
    kmem_caches.

    It allows to remove a lot of config-dependent code and two pointers from
    the kmem_cache structure.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-16-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The memcg_kmem_get_cache() function became really trivial, so let's just
    inline it into the single call point: memcg_slab_pre_alloc_hook().

    It will make the code less bulky and can also help the compiler to
    generate a better code.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-15-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Because the number of non-root kmem_caches doesn't depend on the number of
    memory cgroups anymore and is generally not very big, there is no more
    need for a dedicated workqueue.

    Also, as there is no more need to pass any arguments to the
    memcg_create_kmem_cache() except the root kmem_cache, it's possible to
    just embed the work structure into the kmem_cache and avoid the dynamic
    allocation of the work structure.

    This will also simplify the synchronization: for each root kmem_cache
    there is only one work. So there will be no more concurrent attempts to
    create a non-root kmem_cache for a root kmem_cache: the second and all
    following attempts to queue the work will fail.

    On the kmem_cache destruction path there is no more need to call the
    expensive flush_workqueue() and wait for all pending works to be finished.
    Instead, cancel_work_sync() can be used to cancel/wait for only one work.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-14-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This is fairly big but mostly red patch, which makes all accounted slab
    allocations use a single set of kmem_caches instead of creating a separate
    set for each memory cgroup.

    Because the number of non-root kmem_caches is now capped by the number of
    root kmem_caches, there is no need to shrink or destroy them prematurely.
    They can be perfectly destroyed together with their root counterparts.
    This allows to dramatically simplify the management of non-root
    kmem_caches and delete a ton of code.

    This patch performs the following changes:
    1) introduces memcg_params.memcg_cache pointer to represent the
    kmem_cache which will be used for all non-root allocations
    2) reuses the existing memcg kmem_cache creation mechanism
    to create memcg kmem_cache on the first allocation attempt
    3) memcg kmem_caches are named -memcg,
    e.g. dentry-memcg
    4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
    or schedule it's creation and return the root cache
    5) removes almost all non-root kmem_cache management code
    (separate refcounter, reparenting, shrinking, etc)
    6) makes slab debugfs to display root_mem_cgroup css id and never
    show :dead and :deact flags in the memcg_slabinfo attribute.

    Following patches in the series will simplify the kmem_cache creation.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Switch to per-object accounting of non-root slab objects.

    Charging is performed using obj_cgroup API in the pre_alloc hook.
    Obj_cgroup is charged with the size of the object and the size of
    metadata: as now it's the size of an obj_cgroup pointer. If the amount of
    memory has been charged successfully, the actual allocation code is
    executed. Otherwise, -ENOMEM is returned.

    In the post_alloc hook if the actual allocation succeeded, corresponding
    vmstats are bumped and the obj_cgroup pointer is saved. Otherwise, the
    charge is canceled.

    On the free path obj_cgroup pointer is obtained and used to uncharge the
    size of the releasing object.

    Memcg and lruvec counters are now representing only memory used by active
    slab objects and do not include the free space. The free space is shared
    and doesn't belong to any specific cgroup.

    Global per-node slab vmstats are still modified from
    (un)charge_slab_page() functions. The idea is to keep all slab pages
    accounted as slab pages on system level.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-10-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Store the obj_cgroup pointer in the corresponding place of
    page->obj_cgroups for each allocated non-root slab object. Make sure that
    each allocated object holds a reference to obj_cgroup.

    Objcg pointer is obtained from the memcg->objcg dereferencing in
    memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook.
    Then in case of successful allocation(s) it's getting stored in the
    page->obj_cgroups vector.

    The objcg obtaining part look a bit bulky now, but it will be simplified
    by next commits in the series.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Allocate and release memory to store obj_cgroup pointers for each non-root
    slab page. Reuse page->mem_cgroup pointer to store a pointer to the
    allocated space.

    This commit temporarily increases the memory footprint of the kernel memory
    accounting. To store obj_cgroup pointers we'll need a place for an
    objcg_pointer for each allocated object. However, the following patches
    in the series will enable sharing of slab pages between memory cgroups,
    which will dramatically increase the total slab utilization. And the final
    memory footprint will be significantly smaller than before.

    To distinguish between obj_cgroups and memcg pointers in case when it's
    not obvious which one is used (as in page_cgroup_ino()), let's always set
    the lowest bit in the obj_cgroup case. The original obj_cgroups
    pointer is marked to be ignored by kmemleak, which otherwise would
    report a memory leak for each allocated vector.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-8-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The reference counting of a memcg is currently coupled directly to how
    many 4k pages are charged to it. This doesn't work well with Roman's new
    slab controller, which maintains pools of objects and doesn't want to keep
    an extra balance sheet for the pages backing those objects.

    This unusual refcounting design (reference counts usually track pointers
    to an object) is only for historical reasons: memcg used to not take any
    css references and simply stalled offlining until all charges had been
    reparented and the page counters had dropped to zero. When we got rid of
    the reparenting requirement, the simple mechanical translation was to take
    a reference for every charge.

    More historical context can be found in commit e8ea14cc6ead ("mm:
    memcontrol: take a css reference for each charged page"), commit
    64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning tricks") and
    commit b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined
    groups").

    The new slab controller exposes the limitations in this scheme, so let's
    switch it to a more idiomatic reference counting model based on actual
    kernel pointers to the memcg:

    - The per-cpu stock holds a reference to the memcg its caching

    - User pages hold a reference for their page->mem_cgroup. Transparent
    huge pages will no longer acquire tail references in advance, we'll
    get them if needed during the split.

    - Kernel pages hold a reference for their page->mem_cgroup

    - Pages allocated in the root cgroup will acquire and release css
    references for simplicity. css_get() and css_put() optimize that.

    - The current memcg_charge_slab() already hacked around the per-charge
    references; this change gets rid of that as well.

    - tcp accounting will handle reference in mem_cgroup_sk_{alloc,free}

    Roman:
    1) Rebased on top of the current mm tree: added css_get() in
    mem_cgroup_charge(), dropped mem_cgroup_try_charge() part
    2) I've reformatted commit references in the commit log to make
    checkpatch.pl happy.

    [hughd@google.com: remove css_put_many() from __mem_cgroup_clear_mc()]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007302011450.2347@eggly.anvils

    Signed-off-by: Johannes Weiner
    Signed-off-by: Roman Gushchin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200623174037.3951353-6-guro@fb.com
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In order to prepare for per-object slab memory accounting, convert
    NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.

    To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
    NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).

    Internally global and per-node counters are stored in pages, however memcg
    and lruvec counters are stored in bytes. This scheme may look weird, but
    only for now. As soon as slab pages will be shared between multiple
    cgroups, global and node counters will reflect the total number of slab
    pages. However memcg and lruvec counters will be used for per-memcg slab
    memory tracking, which will take separate kernel objects in the account.
    Keeping global and node counters in pages helps to avoid additional
    overhead.

    The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
    will fit into atomic_long_t we use for vmstats.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • cache_from_obj() was added by commit b9ce5ef49f00 ("sl[au]b: always get
    the cache from its page in kmem_cache_free()") to support kmemcg, where
    per-memcg cache can be different from the root one, so we can't use the
    kmem_cache pointer given to kmem_cache_free().

    Prior to that commit, SLUB already had debugging check+warning that could
    be enabled to compare the given kmem_cache pointer to one referenced by
    the slab page where the object-to-be-freed resides. This check was moved
    to cache_from_obj(). Later the check was also enabled for
    SLAB_FREELIST_HARDENED configs by commit 598a0717a816 ("mm/slab: validate
    cache membership under freelist hardening").

    These checks and warnings can be useful especially for the debugging,
    which can be improved. Commit 598a0717a816 changed the pr_err() with
    WARN_ON_ONCE() to WARN_ONCE() so only the first hit is now reported,
    others are silent. This patch changes it to WARN() so that all errors are
    reported.

    It's also useful to print SLUB allocation/free tracking info for the
    offending object, if tracking is enabled. Thus, export the SLUB
    print_tracking() function and provide an empty one for SLAB.

    For SLUB we can also benefit from the static key check in
    kmem_cache_debug_flags(), but we need to move this function to slab.h and
    declare the static key there.

    [1] https://lore.kernel.org/r/20200608230654.828134-18-guro@fb.com

    [vbabka@suse.cz: avoid bogus WARN()]
    Link: https://lore.kernel.org/r/20200623090213.GW5535@shao2-debian
    Link: http://lkml.kernel.org/r/b33e0fa7-cd28-4788-9e54-5927846329ef@suse.cz

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Acked-by: Kees Cook
    Acked-by: Roman Gushchin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Garrett
    Cc: Jann Horn
    Cc: Vijayanand Jitta
    Cc: Vinayak Menon
    Link: http://lkml.kernel.org/r/afeda7ac-748b-33d8-a905-56b708148ad5@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka