07 Dec, 2020

1 commit

  • Commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches
    for all allocations") introduced a regression into the handling of the
    obj_cgroup_charge() return value. If a non-zero value is returned
    (indicating of exceeding one of memory.max limits), the allocation
    should fail, instead of falling back to non-accounted mode.

    To make the code more readable, move memcg_slab_pre_alloc_hook() and
    memcg_slab_post_alloc_hook() calling conditions into bodies of these
    hooks.

    Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Link: https://lkml.kernel.org/r/20201127161828.GD840171@carbon.dhcp.thefacebook.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

19 Oct, 2020

1 commit

  • Patch series "mm: kmem: kernel memory accounting in an interrupt context".

    This patchset implements memcg-based memory accounting of allocations made
    from an interrupt context.

    Historically, such allocations were passed unaccounted mostly because
    charging the memory cgroup of the current process wasn't an option. Also
    performance reasons were likely a reason too.

    The remote charging API allows to temporarily overwrite the currently
    active memory cgroup, so that all memory allocations are accounted towards
    some specified memory cgroup instead of the memory cgroup of the current
    process.

    This patchset extends the remote charging API so that it can be used from
    an interrupt context. Then it removes the fence that prevented the
    accounting of allocations made from an interrupt context. It also
    contains a couple of optimizations/code refactorings.

    This patchset doesn't directly enable accounting for any specific
    allocations, but prepares the code base for it. The bpf memory accounting
    will likely be the first user of it: a typical example is a bpf program
    parsing an incoming network packet, which allocates an entry in hashmap
    map to store some information.

    This patch (of 4):

    Currently memcg_kmem_bypass() is called before obtaining the current
    memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving
    memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
    number of call sites and allows further code simplifications.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

17 Oct, 2020

1 commit

  • Remove duplicate header which is included twice.

    Signed-off-by: YueHaibing
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200818114323.58156-1-yuehaibing@huawei.com
    Signed-off-by: Linus Torvalds

    YueHaibing
     

14 Oct, 2020

1 commit

  • Object cgroup charging is done for all the objects during allocation, but
    during freeing, uncharging ends up happening for only one object in the
    case of bulk allocation/freeing.

    Fix this by having a separate call to uncharge all the objects from
    kmem_cache_free_bulk() and by modifying memcg_slab_free_hook() to take
    care of bulk uncharging.

    Fixes: 964d4bd370d5 ("mm: memcg/slab: save obj_cgroup for non-root slab objects"
    Signed-off-by: Bharata B Rao
    Signed-off-by: Andrew Morton
    Acked-by: Roman Gushchin
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc:
    Link: https://lkml.kernel.org/r/20201009060423.390479-1-bharata@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Bharata B Rao
     

08 Aug, 2020

15 commits

  • charge_slab_page() and uncharge_slab_page() are not related anymore to
    memcg charging and uncharging. In order to make their names less
    confusing, let's rename them to account_slab_page() and
    unaccount_slab_page() respectively.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200707173612.124425-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • charge_slab_page() is not using the gfp argument anymore,
    remove it.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200707173612.124425-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Instead of having two sets of kmem_caches: one for system-wide and
    non-accounted allocations and the second one shared by all accounted
    allocations, we can use just one.

    The idea is simple: space for obj_cgroup metadata can be allocated on
    demand and filled only for accounted allocations.

    It allows to remove a bunch of code which is required to handle kmem_cache
    clones for accounted allocations. There is no more need to create them,
    accumulate statistics, propagate attributes, etc. It's a quite
    significant simplification.

    Also, because the total number of slab_caches is reduced almost twice (not
    all kmem_caches have a memcg clone), some additional memory savings are
    expected. On my devvm it additionally saves about 3.5% of slab memory.

    [guro@fb.com: fix build on MIPS]
    Link: http://lkml.kernel.org/r/20200717214810.3733082-1-guro@fb.com

    Suggested-by: Johannes Weiner
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Naresh Kamboju
    Link: http://lkml.kernel.org/r/20200623174037.3951353-18-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently there are two lists of kmem_caches:
    1) slab_caches, which contains all kmem_caches,
    2) slab_root_caches, which contains only root kmem_caches.

    And there is some preprocessor magic to have a single list if
    CONFIG_MEMCG_KMEM isn't enabled.

    It was required earlier because the number of non-root kmem_caches was
    proportional to the number of memory cgroups and could reach really big
    values. Now, when it cannot exceed the number of root kmem_caches, there
    is really no reason to maintain two lists.

    We never iterate over the slab_root_caches list on any hot paths, so it's
    perfectly fine to iterate over slab_caches and filter out non-root
    kmem_caches.

    It allows to remove a lot of config-dependent code and two pointers from
    the kmem_cache structure.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-16-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The memcg_kmem_get_cache() function became really trivial, so let's just
    inline it into the single call point: memcg_slab_pre_alloc_hook().

    It will make the code less bulky and can also help the compiler to
    generate a better code.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-15-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Because the number of non-root kmem_caches doesn't depend on the number of
    memory cgroups anymore and is generally not very big, there is no more
    need for a dedicated workqueue.

    Also, as there is no more need to pass any arguments to the
    memcg_create_kmem_cache() except the root kmem_cache, it's possible to
    just embed the work structure into the kmem_cache and avoid the dynamic
    allocation of the work structure.

    This will also simplify the synchronization: for each root kmem_cache
    there is only one work. So there will be no more concurrent attempts to
    create a non-root kmem_cache for a root kmem_cache: the second and all
    following attempts to queue the work will fail.

    On the kmem_cache destruction path there is no more need to call the
    expensive flush_workqueue() and wait for all pending works to be finished.
    Instead, cancel_work_sync() can be used to cancel/wait for only one work.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-14-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This is fairly big but mostly red patch, which makes all accounted slab
    allocations use a single set of kmem_caches instead of creating a separate
    set for each memory cgroup.

    Because the number of non-root kmem_caches is now capped by the number of
    root kmem_caches, there is no need to shrink or destroy them prematurely.
    They can be perfectly destroyed together with their root counterparts.
    This allows to dramatically simplify the management of non-root
    kmem_caches and delete a ton of code.

    This patch performs the following changes:
    1) introduces memcg_params.memcg_cache pointer to represent the
    kmem_cache which will be used for all non-root allocations
    2) reuses the existing memcg kmem_cache creation mechanism
    to create memcg kmem_cache on the first allocation attempt
    3) memcg kmem_caches are named -memcg,
    e.g. dentry-memcg
    4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
    or schedule it's creation and return the root cache
    5) removes almost all non-root kmem_cache management code
    (separate refcounter, reparenting, shrinking, etc)
    6) makes slab debugfs to display root_mem_cgroup css id and never
    show :dead and :deact flags in the memcg_slabinfo attribute.

    Following patches in the series will simplify the kmem_cache creation.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Switch to per-object accounting of non-root slab objects.

    Charging is performed using obj_cgroup API in the pre_alloc hook.
    Obj_cgroup is charged with the size of the object and the size of
    metadata: as now it's the size of an obj_cgroup pointer. If the amount of
    memory has been charged successfully, the actual allocation code is
    executed. Otherwise, -ENOMEM is returned.

    In the post_alloc hook if the actual allocation succeeded, corresponding
    vmstats are bumped and the obj_cgroup pointer is saved. Otherwise, the
    charge is canceled.

    On the free path obj_cgroup pointer is obtained and used to uncharge the
    size of the releasing object.

    Memcg and lruvec counters are now representing only memory used by active
    slab objects and do not include the free space. The free space is shared
    and doesn't belong to any specific cgroup.

    Global per-node slab vmstats are still modified from
    (un)charge_slab_page() functions. The idea is to keep all slab pages
    accounted as slab pages on system level.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-10-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Store the obj_cgroup pointer in the corresponding place of
    page->obj_cgroups for each allocated non-root slab object. Make sure that
    each allocated object holds a reference to obj_cgroup.

    Objcg pointer is obtained from the memcg->objcg dereferencing in
    memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook.
    Then in case of successful allocation(s) it's getting stored in the
    page->obj_cgroups vector.

    The objcg obtaining part look a bit bulky now, but it will be simplified
    by next commits in the series.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Allocate and release memory to store obj_cgroup pointers for each non-root
    slab page. Reuse page->mem_cgroup pointer to store a pointer to the
    allocated space.

    This commit temporarily increases the memory footprint of the kernel memory
    accounting. To store obj_cgroup pointers we'll need a place for an
    objcg_pointer for each allocated object. However, the following patches
    in the series will enable sharing of slab pages between memory cgroups,
    which will dramatically increase the total slab utilization. And the final
    memory footprint will be significantly smaller than before.

    To distinguish between obj_cgroups and memcg pointers in case when it's
    not obvious which one is used (as in page_cgroup_ino()), let's always set
    the lowest bit in the obj_cgroup case. The original obj_cgroups
    pointer is marked to be ignored by kmemleak, which otherwise would
    report a memory leak for each allocated vector.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-8-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The reference counting of a memcg is currently coupled directly to how
    many 4k pages are charged to it. This doesn't work well with Roman's new
    slab controller, which maintains pools of objects and doesn't want to keep
    an extra balance sheet for the pages backing those objects.

    This unusual refcounting design (reference counts usually track pointers
    to an object) is only for historical reasons: memcg used to not take any
    css references and simply stalled offlining until all charges had been
    reparented and the page counters had dropped to zero. When we got rid of
    the reparenting requirement, the simple mechanical translation was to take
    a reference for every charge.

    More historical context can be found in commit e8ea14cc6ead ("mm:
    memcontrol: take a css reference for each charged page"), commit
    64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning tricks") and
    commit b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined
    groups").

    The new slab controller exposes the limitations in this scheme, so let's
    switch it to a more idiomatic reference counting model based on actual
    kernel pointers to the memcg:

    - The per-cpu stock holds a reference to the memcg its caching

    - User pages hold a reference for their page->mem_cgroup. Transparent
    huge pages will no longer acquire tail references in advance, we'll
    get them if needed during the split.

    - Kernel pages hold a reference for their page->mem_cgroup

    - Pages allocated in the root cgroup will acquire and release css
    references for simplicity. css_get() and css_put() optimize that.

    - The current memcg_charge_slab() already hacked around the per-charge
    references; this change gets rid of that as well.

    - tcp accounting will handle reference in mem_cgroup_sk_{alloc,free}

    Roman:
    1) Rebased on top of the current mm tree: added css_get() in
    mem_cgroup_charge(), dropped mem_cgroup_try_charge() part
    2) I've reformatted commit references in the commit log to make
    checkpatch.pl happy.

    [hughd@google.com: remove css_put_many() from __mem_cgroup_clear_mc()]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007302011450.2347@eggly.anvils

    Signed-off-by: Johannes Weiner
    Signed-off-by: Roman Gushchin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200623174037.3951353-6-guro@fb.com
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In order to prepare for per-object slab memory accounting, convert
    NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.

    To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
    NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).

    Internally global and per-node counters are stored in pages, however memcg
    and lruvec counters are stored in bytes. This scheme may look weird, but
    only for now. As soon as slab pages will be shared between multiple
    cgroups, global and node counters will reflect the total number of slab
    pages. However memcg and lruvec counters will be used for per-memcg slab
    memory tracking, which will take separate kernel objects in the account.
    Keeping global and node counters in pages helps to avoid additional
    overhead.

    The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
    will fit into atomic_long_t we use for vmstats.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • cache_from_obj() was added by commit b9ce5ef49f00 ("sl[au]b: always get
    the cache from its page in kmem_cache_free()") to support kmemcg, where
    per-memcg cache can be different from the root one, so we can't use the
    kmem_cache pointer given to kmem_cache_free().

    Prior to that commit, SLUB already had debugging check+warning that could
    be enabled to compare the given kmem_cache pointer to one referenced by
    the slab page where the object-to-be-freed resides. This check was moved
    to cache_from_obj(). Later the check was also enabled for
    SLAB_FREELIST_HARDENED configs by commit 598a0717a816 ("mm/slab: validate
    cache membership under freelist hardening").

    These checks and warnings can be useful especially for the debugging,
    which can be improved. Commit 598a0717a816 changed the pr_err() with
    WARN_ON_ONCE() to WARN_ONCE() so only the first hit is now reported,
    others are silent. This patch changes it to WARN() so that all errors are
    reported.

    It's also useful to print SLUB allocation/free tracking info for the
    offending object, if tracking is enabled. Thus, export the SLUB
    print_tracking() function and provide an empty one for SLAB.

    For SLUB we can also benefit from the static key check in
    kmem_cache_debug_flags(), but we need to move this function to slab.h and
    declare the static key there.

    [1] https://lore.kernel.org/r/20200608230654.828134-18-guro@fb.com

    [vbabka@suse.cz: avoid bogus WARN()]
    Link: https://lore.kernel.org/r/20200623090213.GW5535@shao2-debian
    Link: http://lkml.kernel.org/r/b33e0fa7-cd28-4788-9e54-5927846329ef@suse.cz

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Acked-by: Kees Cook
    Acked-by: Roman Gushchin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Garrett
    Cc: Jann Horn
    Cc: Vijayanand Jitta
    Cc: Vinayak Menon
    Link: http://lkml.kernel.org/r/afeda7ac-748b-33d8-a905-56b708148ad5@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The function cache_from_obj() was added by commit b9ce5ef49f00 ("sl[au]b:
    always get the cache from its page in kmem_cache_free()") to support
    kmemcg, where per-memcg cache can be different from the root one, so we
    can't use the kmem_cache pointer given to kmem_cache_free().

    Prior to that commit, SLUB already had debugging check+warning that could
    be enabled to compare the given kmem_cache pointer to one referenced by
    the slab page where the object-to-be-freed resides. This check was moved
    to cache_from_obj(). Later the check was also enabled for
    SLAB_FREELIST_HARDENED configs by commit 598a0717a816 ("mm/slab: validate
    cache membership under freelist hardening").

    These checks and warnings can be useful especially for the debugging,
    which can be improved. Commit 598a0717a816 changed the pr_err() with
    WARN_ON_ONCE() to WARN_ONCE() so only the first hit is now reported,
    others are silent. This patch changes it to WARN() so that all errors are
    reported.

    It's also useful to print SLUB allocation/free tracking info for the
    offending object, if tracking is enabled. We could export the SLUB
    print_tracking() function and provide an empty one for SLAB, or realize
    that both the debugging and hardening cases in cache_from_obj() are only
    supported by SLUB anyway. So this patch moves cache_from_obj() from
    slab.h to separate instances in slab.c and slub.c, where the SLAB version
    only does the kmemcg lookup and even could be completely removed once the
    kmemcg rework [1] is merged. The SLUB version can thus easily use the
    print_tracking() function. It can also use the kmem_cache_debug_flags()
    static key check for improved performance in kernels without the hardening
    and with debugging not enabled on boot.

    [1] https://lore.kernel.org/r/20200608230654.828134-18-guro@fb.com

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Jann Horn
    Cc: Kees Cook
    Cc: Vijayanand Jitta
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200610163135.17364-10-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • kmalloc cannot allocate memory from HIGHMEM. Allocating large amounts of
    memory currently bypasses the check and will simply leak the memory when
    page_address() returns NULL. To fix this, factor the GFP_SLAB_BUG_MASK
    check out of slab & slub, and call it from kmalloc_order() as well. In
    order to make the code clear, the warning message is put in one place.

    Signed-off-by: Long Li
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200704035027.GA62481@lilong
    Signed-off-by: Linus Torvalds

    Long Li
     

26 Jun, 2020

1 commit

  • It was found that running the LTP test on a PowerPC system could produce
    erroneous values in /proc/meminfo, like:

    MemTotal: 531915072 kB
    MemFree: 507962176 kB
    MemAvailable: 1100020596352 kB

    Using bisection, the problem is tracked down to commit 9c315e4d7d8c ("mm:
    memcg/slab: cache page number in memcg_(un)charge_slab()").

    In memcg_uncharge_slab() with a "int order" argument:

    unsigned int nr_pages = 1 << order;
    :
    mod_lruvec_state(lruvec, cache_vmstat_idx(s), -nr_pages);

    The mod_lruvec_state() function will eventually call the
    __mod_zone_page_state() which accepts a long argument. Depending on the
    compiler and how inlining is done, "-nr_pages" may be treated as a
    negative number or a very large positive number. Apparently, it was
    treated as a large positive number in that PowerPC system leading to
    incorrect stat counts. This problem hasn't been seen in x86-64 yet,
    perhaps the gcc compiler there has some slight difference in behavior.

    It is fixed by making nr_pages a signed value. For consistency, a similar
    change is applied to memcg_charge_slab() as well.

    Link: http://lkml.kernel.org/r/20200620184719.10994-1-longman@redhat.com
    Fixes: 9c315e4d7d8c ("mm: memcg/slab: cache page number in memcg_(un)charge_slab()").
    Signed-off-by: Waiman Long
    Acked-by: Roman Gushchin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     

03 Apr, 2020

5 commits

  • Drop the _memcg suffix from (__)memcg_kmem_(un)charge functions. It's
    shorter and more obvious.

    These are the most basic functions which are just (un)charging the given
    cgroup with the given amount of pages.

    Also fix up the corresponding comments.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Link: http://lkml.kernel.org/r/20200109202659.752357-7-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • There are many places in memcg_charge_slab() and memcg_uncharge_slab()
    which are calculating the number of pages to charge, css references to
    grab etc depending on the order of the slab page.

    Let's simplify the code by calculating it once and caching in the local
    variable.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Link: http://lkml.kernel.org/r/20200109202659.752357-6-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • These functions are charging the given number of kernel pages to the given
    memory cgroup. The number doesn't have to be a power of two. Let's make
    them to take the unsigned int nr_pages as an argument instead of the page
    order.

    It makes them look consistent with the corresponding uncharge functions
    and functions like: mem_cgroup_charge_skmem(memcg, nr_pages).

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Link: http://lkml.kernel.org/r/20200109202659.752357-5-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Drop the unused page argument and put the memcg pointer at the first
    place. This make the function consistent with its peers:
    __memcg_kmem_uncharge_memcg(), memcg_kmem_charge_memcg(), etc.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Link: http://lkml.kernel.org/r/20200109202659.752357-3-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "mm: memcg: kmem API cleanup", v2.

    This patchset aims to clean up the kernel memory charging API. It doesn't
    bring any functional changes, just removes unused arguments, renames some
    functions and fixes some comments.

    Currently it's not obvious which functions are most basic
    (memcg_kmem_(un)charge_memcg()) and which are based on them
    (memcg_kmem_(un)charge()). The patchset renames these functions and
    removes unused arguments:

    TL;DR:
    was:
    memcg_kmem_charge_memcg(page, gfp, order, memcg)
    memcg_kmem_uncharge_memcg(memcg, nr_pages)
    memcg_kmem_charge(page, gfp, order)
    memcg_kmem_uncharge(page, order)

    now:
    memcg_kmem_charge(memcg, gfp, nr_pages)
    memcg_kmem_uncharge(memcg, nr_pages)
    memcg_kmem_charge_page(page, gfp, order)
    memcg_kmem_uncharge_page(page, order)

    This patch (of 6):

    The first argument of memcg_kmem_charge_memcg() and
    __memcg_kmem_charge_memcg() is the page pointer and it's not used. Let's
    drop it.

    Memcg pointer is passed as the last argument. Move it to the first place
    for consistency with other memcg functions, e.g.
    __memcg_kmem_uncharge_memcg() or try_charge().

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Link: http://lkml.kernel.org/r/20200109202659.752357-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

02 Dec, 2019

1 commit

  • There is a per-memcg lruvec and a NUMA node lruvec. Which one is being
    used is somewhat confusing right now, and it's easy to make mistakes -
    especially when it comes to global reclaim.

    How it works: when memory cgroups are enabled, we always use the
    root_mem_cgroup's per-node lruvecs. When memory cgroups are not compiled
    in or disabled at runtime, we use pgdat->lruvec.

    Document that in a comment.

    Due to the way the reclaim code is generalized, all lookups use the
    mem_cgroup_lruvec() helper function, and nobody should have to find the
    right lruvec manually right now. But to avoid future mistakes, rename the
    pgdat->lruvec member to pgdat->__lruvec and delete the convenience wrapper
    that suggests it's a commonly accessed member.

    While in this area, swap the mem_cgroup_lruvec() argument order. The name
    suggests a memcg operation, yet it takes a pgdat first and a memcg second.
    I have to double take every time I call this. Fix that.

    Link: http://lkml.kernel.org/r/20191022144803.302233-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Shakeel Butt
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Dec, 2019

1 commit

  • Patch series "mm, slab: Make kmalloc_info[] contain all types of names", v6.

    There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
    and KMALLOC_DMA.

    The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
    but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
    generated by kmalloc_cache_name().

    Patch1 predefines the names of all types of kmalloc to save
    the time spent dynamically generating names.

    These changes make sense, and the time spent by new_kmalloc_cache()
    has been reduced by approximately 36.3%.

    Time spent by new_kmalloc_cache()
    (CPU cycles)
    5.3-rc7 66264
    5.3-rc7+patch 42188

    This patch (of 3):

    There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM and
    KMALLOC_DMA.

    The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, but the
    names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically generated by
    kmalloc_cache_name().

    This patch predefines the names of all types of kmalloc to save the time
    spent dynamically generating names.

    Besides, remove the kmalloc_cache_name() that is no longer used.

    Link: http://lkml.kernel.org/r/1569241648-26908-2-git-send-email-lpf.vector@gmail.com
    Signed-off-by: Pengfei Li
    Acked-by: Vlastimil Babka
    Acked-by: Roman Gushchin
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pengfei Li
     

07 Nov, 2019

1 commit

  • page_cgroup_ino() doesn't return a valid memcg pointer for non-compound
    slab pages, because it depends on PgHead AND PgSlab flags to be set to
    determine the memory cgroup from the kmem_cache. It's correct for
    compound pages, but not for generic small pages. Those don't have PgHead
    set, so it ends up returning zero.

    Fix this by replacing the condition to PageSlab() && !PageTail().

    Before this patch:
    [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
    0x0000000000000080 38 0 _______S___________________________________ slab

    After this patch:
    [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
    0x0000000000000080 147 0 _______S___________________________________ slab

    Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order
    to filter error injection events based on memcg. So if
    page_cgroup_ino() fails to return memcg pointer, we just fail to inject
    memory error. Considering that hwpoison filter is for testing, affected
    users are limited and the impact should be marginal.

    [n-horiguchi@ah.jp.nec.com: changelog additions]
    Link: http://lkml.kernel.org/r/20191031012151.2722280-1-guro@fb.com
    Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: David Rientjes
    Cc: Vladimir Davydov
    Cc: Daniel Jordan
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

25 Sep, 2019

2 commits

  • The memcg_cache_params structure is only embedded into the kmem_cache of
    slab and slub allocators as defined in slab_def.h and slub_def.h and used
    internally by mm code. There is no needed to expose it in a public
    header. So move it from include/linux/slab.h to mm/slab.h. It is just a
    refactoring patch with no code change.

    In fact both the slub_def.h and slab_def.h should be moved into the mm
    directory as well, but that will probably cause many merge conflicts.

    Link: http://lkml.kernel.org/r/20190718180827.18758-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Currently, a value of '1" is written to /sys/kernel/slab//shrink
    file to shrink the slab by flushing out all the per-cpu slabs and free
    slabs in partial lists. This can be useful to squeeze out a bit more
    memory under extreme condition as well as making the active object counts
    in /proc/slabinfo more accurate.

    This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
    option is usually not enabled and "slub_memcg_sysfs=1" not set. Even if
    memcg sysfs is turned on, it is too cumbersome and impractical to manage
    all those per-memcg sysfs files in a real production system.

    So there is no practical way to shrink memcg caches. Fix this by enabling
    a proper write to the shrink sysfs file of the root cache to scan all the
    available memcg caches and shrink them as well. For a non-root memcg
    cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is on), only that
    cache will be shrunk when written.

    On a 2-socket 64-core 256-thread arm64 system with 64k page after
    a parallel kernel build, the the amount of memory occupied by slabs
    before shrinking slabs were:

    # grep task_struct /proc/slabinfo
    task_struct 53137 53192 4288 61 4 : tunables 0 0
    0 : slabdata 872 872 0
    # grep "^S[lRU]" /proc/meminfo
    Slab: 3936832 kB
    SReclaimable: 399104 kB
    SUnreclaim: 3537728 kB

    After shrinking slabs (by echoing "1" to all shrink files):

    # grep "^S[lRU]" /proc/meminfo
    Slab: 1356288 kB
    SReclaimable: 263296 kB
    SUnreclaim: 1092992 kB
    # grep task_struct /proc/slabinfo
    task_struct 2764 6832 4288 61 4 : tunables 0 0
    0 : slabdata 112 112 0

    Link: http://lkml.kernel.org/r/20190723151445.7385-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Acked-by: Roman Gushchin
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     

13 Jul, 2019

10 commits

  • Patch series "add init_on_alloc/init_on_free boot options", v10.

    Provide init_on_alloc and init_on_free boot options.

    These are aimed at preventing possible information leaks and making the
    control-flow bugs that depend on uninitialized values more deterministic.

    Enabling either of the options guarantees that the memory returned by the
    page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
    isn't supported at the moment, as its emulation of kmem caches complicates
    handling of SLAB_TYPESAFE_BY_RCU caches correctly.

    Enabling init_on_free also guarantees that pages and heap objects are
    initialized right after they're freed, so it won't be possible to access
    stale data by using a dangling pointer.

    As suggested by Michal Hocko, right now we don't let the heap users to
    disable initialization for certain allocations. There's not enough
    evidence that doing so can speed up real-life cases, and introducing ways
    to opt-out may result in things going out of control.

    This patch (of 2):

    The new options are needed to prevent possible information leaks and make
    control-flow bugs that depend on uninitialized values more deterministic.

    This is expected to be on-by-default on Android and Chrome OS. And it
    gives the opportunity for anyone else to use it under distros too via the
    boot args. (The init_on_free feature is regularly requested by folks
    where memory forensics is included in their threat models.)

    init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
    objects with zeroes. Initialization is done at allocation time at the
    places where checks for __GFP_ZERO are performed.

    init_on_free=1 makes the kernel initialize freed pages and heap objects
    with zeroes upon their deletion. This helps to ensure sensitive data
    doesn't leak via use-after-free accesses.

    Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
    returns zeroed memory. The two exceptions are slab caches with
    constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
    zero-initialized to preserve their semantics.

    Both init_on_alloc and init_on_free default to zero, but those defaults
    can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
    CONFIG_INIT_ON_FREE_DEFAULT_ON.

    If either SLUB poisoning or page poisoning is enabled, those options take
    precedence over init_on_alloc and init_on_free: initialization is only
    applied to unpoisoned allocations.

    Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:

    hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
    hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)

    Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
    Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
    Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
    Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)

    The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
    is within the standard error.

    The new features are also going to pave the way for hardware memory
    tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
    hooks to set the tags for heap objects. With MTE, tagging will have the
    same cost as memory initialization.

    Although init_on_free is rather costly, there are paranoid use-cases where
    in-memory data lifetime is desired to be minimized. There are various
    arguments for/against the realism of the associated threat models, but
    given that we'll need the infrastructure for MTE anyway, and there are
    people who want wipe-on-free behavior no matter what the performance cost,
    it seems reasonable to include it in this series.

    [glider@google.com: v8]
    Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
    [glider@google.com: v10]
    Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
    Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
    Signed-off-by: Alexander Potapenko
    Acked-by: Kees Cook
    Acked-by: Michal Hocko [page and dmapool parts
    Acked-by: James Morris ]
    Cc: Christoph Lameter
    Cc: Masahiro Yamada
    Cc: "Serge E. Hallyn"
    Cc: Nick Desaulniers
    Cc: Kostya Serebryany
    Cc: Dmitry Vyukov
    Cc: Sandeep Patil
    Cc: Laura Abbott
    Cc: Randy Dunlap
    Cc: Jann Horn
    Cc: Mark Rutland
    Cc: Marco Elver
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Let's reparent non-root kmem_caches on memcg offlining. This allows us to
    release the memory cgroup without waiting for the last outstanding kernel
    object (e.g. dentry used by another application).

    Since the parent cgroup is already charged, everything we need to do is to
    splice the list of kmem_caches to the parent's kmem_caches list, swap the
    memcg pointer, drop the css refcounter for each kmem_cache and adjust the
    parent's css refcounter.

    Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer
    anymore. It's safe to read it under rcu_read_lock(), cgroup_mutex held,
    or any other way that protects the memory cgroup from being released.

    We can race with the slab allocation and deallocation paths. It's not a
    big problem: parent's charge and slab global stats are always correct, and
    we don't care anymore about the child usage and global stats. The child
    cgroup is already offline, so we don't use or show it anywhere.

    Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't
    used anywhere except count_shadow_nodes(). But even there it won't break
    anything: after reparenting "nodes" will be 0 on child level (because
    we're already reparenting shrinker lists), and on parent level page stats
    always were 0, and this patch won't change anything.

    [guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup]
    Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Every slab page charged to a non-root memory cgroup has a pointer to the
    memory cgroup and holds a reference to it, which protects a non-empty
    memory cgroup from being released. At the same time the page has a
    pointer to the corresponding kmem_cache, and also hold a reference to the
    kmem_cache. And kmem_cache by itself holds a reference to the cgroup.

    So there is clearly some redundancy, which allows to stop setting the
    page->mem_cgroup pointer and rely on getting memcg pointer indirectly via
    kmem_cache. Further it will allow to change this pointer easier, without
    a need to go over all charged pages.

    So let's stop setting page->mem_cgroup pointer for slab pages, and stop
    using the css refcounter directly for protecting the memory cgroup from
    going away. Instead rely on kmem_cache as an intermediate object.

    Make sure that vmstats and shrinker lists are working as previously, as
    well as /proc/kpagecgroup interface.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-10-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently each charged slab page holds a reference to the cgroup to which
    it's charged. Kmem_caches are held by the memcg and are released all
    together with the memory cgroup. It means that none of kmem_caches are
    released unless at least one reference to the memcg exists, which is very
    far from optimal.

    Let's rework it in a way that allows releasing individual kmem_caches as
    soon as the cgroup is offline, the kmem_cache is empty and there are no
    pending allocations.

    To make it possible, let's introduce a new percpu refcounter for non-root
    kmem caches. The counter is initialized to the percpu mode, and is
    switched to the atomic mode during kmem_cache deactivation. The counter
    is bumped for every charged page and also for every running allocation.
    So the kmem_cache can't be released unless all allocations complete.

    To shutdown non-active empty kmem_caches, let's reuse the work queue,
    previously used for the kmem_cache deactivation. Once the reference
    counter reaches 0, let's schedule an asynchronous kmem_cache release.

    * I used the following simple approach to test the performance
    (stolen from another patchset by T. Harding):

    time find / -name fname-no-exist
    echo 2 > /proc/sys/vm/drop_caches
    repeat 10 times

    Results:

    orig patched

    real 0m1.455s real 0m1.355s
    user 0m0.206s user 0m0.219s
    sys 0m0.855s sys 0m0.807s

    real 0m1.487s real 0m1.699s
    user 0m0.221s user 0m0.256s
    sys 0m0.806s sys 0m0.948s

    real 0m1.515s real 0m1.505s
    user 0m0.183s user 0m0.215s
    sys 0m0.876s sys 0m0.858s

    real 0m1.291s real 0m1.380s
    user 0m0.193s user 0m0.198s
    sys 0m0.843s sys 0m0.786s

    real 0m1.364s real 0m1.374s
    user 0m0.180s user 0m0.182s
    sys 0m0.868s sys 0m0.806s

    real 0m1.352s real 0m1.312s
    user 0m0.201s user 0m0.212s
    sys 0m0.820s sys 0m0.761s

    real 0m1.302s real 0m1.349s
    user 0m0.205s user 0m0.203s
    sys 0m0.803s sys 0m0.792s

    real 0m1.334s real 0m1.301s
    user 0m0.194s user 0m0.201s
    sys 0m0.806s sys 0m0.779s

    real 0m1.426s real 0m1.434s
    user 0m0.216s user 0m0.181s
    sys 0m0.824s sys 0m0.864s

    real 0m1.350s real 0m1.295s
    user 0m0.200s user 0m0.190s
    sys 0m0.842s sys 0m0.811s

    So it looks like the difference is not noticeable in this test.

    [cai@lca.pw: fix an use-after-free in kmemcg_workfn()]
    Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw
    Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.com
    Signed-off-by: Roman Gushchin
    Signed-off-by: Qian Cai
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Shakeel Butt
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently the page accounting code is duplicated in SLAB and SLUB
    internals. Let's move it into new (un)charge_slab_page helpers in the
    slab_common.c file. These helpers will be responsible for statistics
    (global and memcg-aware) and memcg charging. So they are replacing direct
    memcg_(un)charge_slab() calls.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Christoph Lameter
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently SLUB uses a work scheduled after an RCU grace period to
    deactivate a non-root kmem_cache. This mechanism can be reused for
    kmem_caches release, but requires generalization for SLAB case.

    Introduce kmemcg_cache_deactivate() function, which calls
    allocator-specific __kmem_cache_deactivate() and schedules execution of
    __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
    context after an rcu grace period.

    Here is the new calling scheme:
    kmemcg_cache_deactivate()
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    __kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific

    instead of:
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    slab_deactivate_memcg_cache_rcu_sched() SLUB-only
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    kmemcg_cache_deact_after_rcu() SLUB-only

    For consistency, all allocator-specific functions start with "__".

    Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The delayed work/rcu deactivation infrastructure of non-root kmem_caches
    can be also used for asynchronous release of these objects. Let's get rid
    of the word "deactivation" in corresponding names to make the code look
    better after generalization.

    It's easier to make the renaming first, so that the generalized code will
    look consistent from scratch.

    Let's rename struct memcg_cache_params fields:
    deact_fn -> work_fn
    deact_rcu_head -> rcu_head
    deact_work -> work

    And RCU/delayed work callbacks in slab common code:
    kmemcg_deactivate_rcufn -> kmemcg_rcufn
    kmemcg_deactivate_workfn -> kmemcg_workfn

    This patch contains no functional changes, only renamings.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-3-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "mm: reparent slab memory on cgroup removal", v7.

    # Why do we need this?

    We've noticed that the number of dying cgroups is steadily growing on most
    of our hosts in production. The following investigation revealed an issue
    in the userspace memory reclaim code [1], accounting of kernel stacks [2],
    and also the main reason: slab objects.

    The underlying problem is quite simple: any page charged to a cgroup holds
    a reference to it, so the cgroup can't be reclaimed unless all charged
    pages are gone. If a slab object is actively used by other cgroups, it
    won't be reclaimed, and will prevent the origin cgroup from being
    reclaimed.

    Slab objects, and first of all vfs cache, is shared between cgroups, which
    are using the same underlying fs, and what's even more important, it's
    shared between multiple generations of the same workload. So if something
    is running periodically every time in a new cgroup (like how systemd
    works), we do accumulate multiple dying cgroups.

    Strictly speaking pagecache isn't different here, but there is a key
    difference: we disable protection and apply some extra pressure on LRUs of
    dying cgroups, and these LRUs contain all charged pages. My experiments
    show that with the disabled kernel memory accounting the number of dying
    cgroups stabilizes at a relatively small number (~100, depends on memory
    pressure and cgroup creation rate), and with kernel memory accounting it
    grows pretty steadily up to several thousands.

    Memory cgroups are quite complex and big objects (mostly due to percpu
    stats), so it leads to noticeable memory losses. Memory occupied by dying
    cgroups is measured in hundreds of megabytes. I've even seen a host with
    more than 100Gb of memory wasted for dying cgroups. It leads to a
    degradation of performance with the uptime, and generally limits the usage
    of cgroups.

    My previous attempt [3] to fix the problem by applying extra pressure on
    slab shrinker lists caused a regressions with xfs and ext4, and has been
    reverted [4]. The following attempts to find the right balance [5, 6]
    were not successful.

    So instead of trying to find a maybe non-existing balance, let's do
    reparent accounted slab caches to the parent cgroup on cgroup removal.

    # Implementation approach

    There is however a significant problem with reparenting of slab memory:
    there is no list of charged pages. Some of them are in shrinker lists,
    but not all. Introducing of a new list is really not an option.

    But fortunately there is a way forward: every slab page has a stable
    pointer to the corresponding kmem_cache. So the idea is to reparent
    kmem_caches instead of slab pages.

    It's actually simpler and cheaper, but requires some underlying changes:
    1) Make kmem_caches to hold a single reference to the memory cgroup,
    instead of a separate reference per every slab page.
    2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
    page->kmem_cache->memcg indirection instead. It's used only on
    slab page release, so performance overhead shouldn't be a big issue.
    3) Introduce a refcounter for non-root slab caches. It's required to
    be able to destroy kmem_caches when they become empty and release
    the associated memory cgroup.

    There is a bonus: currently we release all memcg kmem_caches all together
    with the memory cgroup itself. This patchset allows individual
    kmem_caches to be released as soon as they become inactive and free.

    Some additional implementation details are provided in corresponding
    commit messages.

    # Results

    Below is the average number of dying cgroups on two groups of our
    production hosts. They do run some sort of web frontend workload, the
    memory pressure is moderate. As we can see, with the kernel memory
    reparenting the number stabilizes in 60s range; however with the original
    version it grows almost linearly and doesn't show any signs of plateauing.
    The difference in slab and percpu usage between patched and unpatched
    versions also grows linearly. In 7 days it exceeded 200Mb.

    day 0 1 2 3 4 5 6 7
    original 56 362 628 752 1070 1250 1490 1560
    patched 23 46 51 55 60 57 67 69
    mem diff(Mb) 22 74 123 152 164 182 214 241

    # Links

    [1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error")
    [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
    [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
    [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
    [5]: https://lkml.org/lkml/2019/1/28/1865
    [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2

    This patch (of 10):

    Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
    rather than in init_memcg_params().

    Once kmem_cache will hold a reference to the memory cgroup, it will
    simplify the refcounting.

    For non-root kmem_caches memcg_link_cache() is always called before the
    kmem_cache becomes visible to a user, so it's safe.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Waiman Long
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This avoids any possible type confusion when looking up an object. For
    example, if a non-slab were to be passed to kfree(), the invalid
    slab_cache pointer (i.e. overlapped with some other value from the
    struct page union) would be used for subsequent slab manipulations that
    could lead to further memory corruption.

    Since the page is already in cache, adding the PageSlab() check will
    have nearly zero cost, so add a check and WARN() to virt_to_cache().
    Additionally replaces an open-coded virt_to_cache(). To support the
    failure mode this also updates all callers of virt_to_cache() and
    cache_from_obj() to handle a NULL cache pointer return value (though
    note that several already handle this case gracefully).

    [dan.carpenter@oracle.com: restore IRQs in kfree()]
    Link: http://lkml.kernel.org/r/20190613065637.GE16334@mwanda
    Link: http://lkml.kernel.org/r/20190530045017.15252-3-keescook@chromium.org
    Signed-off-by: Kees Cook
    Signed-off-by: Dan Carpenter
    Cc: Alexander Popov
    Cc: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Greg Kroah-Hartman
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Patch series "mm/slab: Improved sanity checking".

    This adds defenses against slab cache confusion (as seen in real-world
    exploits[1]) and gracefully handles type confusions when trying to look
    up slab caches from an arbitrary page. (Also is patch 3: new LKDTM
    tests for these defenses as well as for the existing double-free
    detection.

    This patch (of 3):

    When building under CONFIG_SLAB_FREELIST_HARDENING, it makes sense to
    perform sanity-checking on the assumed slab cache during
    kmem_cache_free() to make sure the kernel doesn't mix freelists across
    slab caches and corrupt memory (as seen in the exploitation of flaws
    like CVE-2018-9568[1]). Note that the prior code might WARN() but still
    corrupt memory (i.e. return the assumed cache instead of the owned
    cache).

    There is no noticeable performance impact (changes are within noise).
    Measuring parallel kernel builds, I saw the following with
    CONFIG_SLAB_FREELIST_HARDENED, before and after this patch:

    before:

    Run times: 288.85 286.53 287.09 287.07 287.21
    Min: 286.53 Max: 288.85 Mean: 287.35 Std Dev: 0.79

    after:

    Run times: 289.58 287.40 286.97 287.20 287.01
    Min: 286.97 Max: 289.58 Mean: 287.63 Std Dev: 0.99

    Delta: 0.1% which is well below the standard deviation

    [1] https://github.com/ThomasKing2014/slides/raw/master/Building%20universal%20Android%20rooting%20with%20a%20type%20confusion%20vulnerability.pdf

    Link: http://lkml.kernel.org/r/20190530045017.15252-2-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Alexander Popov
    Cc: Alexander Potapenko
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook