17 Oct, 2020

1 commit

  • Correct one function name "get_partials" with "get_partial". Update the
    old struct name of list3 with kmem_cache_node.

    Signed-off-by: Chen Tao
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Rapoport
    Link: https://lkml.kernel.org/r/Message-ID:
    Signed-off-by: Linus Torvalds

    Chen Tao
     

14 Oct, 2020

2 commits

  • Object cgroup charging is done for all the objects during allocation, but
    during freeing, uncharging ends up happening for only one object in the
    case of bulk allocation/freeing.

    Fix this by having a separate call to uncharge all the objects from
    kmem_cache_free_bulk() and by modifying memcg_slab_free_hook() to take
    care of bulk uncharging.

    Fixes: 964d4bd370d5 ("mm: memcg/slab: save obj_cgroup for non-root slab objects"
    Signed-off-by: Bharata B Rao
    Signed-off-by: Andrew Morton
    Acked-by: Roman Gushchin
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc:
    Link: https://lkml.kernel.org/r/20201009060423.390479-1-bharata@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Bharata B Rao
     
  • The removed code was unnecessary and changed nothing in the flow, since in
    case of returning NULL by 'kmem_cache_alloc_node' returning 'freelist'
    from the function in question is the same as returning NULL.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: https://lkml.kernel.org/r/20200915230329.13002-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     

27 Sep, 2020

1 commit

  • With the commit 10befea91b61 ("mm: memcg/slab: use a single set of
    kmem_caches for all allocations"), it becomes possible to call kfree()
    from the slabs_destroy().

    The functions cache_flusharray() and do_drain() calls slabs_destroy() on
    array_cache of the local CPU without updating the size of the
    array_cache. This enables the kfree() call from the slabs_destroy() to
    recursively call cache_flusharray() which can potentially call
    free_block() on the same elements of the array_cache of the local CPU
    and causing double free and memory corruption.

    To fix the issue, simply update the local CPU array_cache cache before
    calling slabs_destroy().

    Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
    Signed-off-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Tested-by: Ming Lei
    Reported-by: kernel test robot
    Cc: Andrew Morton
    Cc: Ted Ts'o
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

08 Aug, 2020

12 commits

  • charge_slab_page() and uncharge_slab_page() are not related anymore to
    memcg charging and uncharging. In order to make their names less
    confusing, let's rename them to account_slab_page() and
    unaccount_slab_page() respectively.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200707173612.124425-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • charge_slab_page() is not using the gfp argument anymore,
    remove it.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200707173612.124425-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Instead of having two sets of kmem_caches: one for system-wide and
    non-accounted allocations and the second one shared by all accounted
    allocations, we can use just one.

    The idea is simple: space for obj_cgroup metadata can be allocated on
    demand and filled only for accounted allocations.

    It allows to remove a bunch of code which is required to handle kmem_cache
    clones for accounted allocations. There is no more need to create them,
    accumulate statistics, propagate attributes, etc. It's a quite
    significant simplification.

    Also, because the total number of slab_caches is reduced almost twice (not
    all kmem_caches have a memcg clone), some additional memory savings are
    expected. On my devvm it additionally saves about 3.5% of slab memory.

    [guro@fb.com: fix build on MIPS]
    Link: http://lkml.kernel.org/r/20200717214810.3733082-1-guro@fb.com

    Suggested-by: Johannes Weiner
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Naresh Kamboju
    Link: http://lkml.kernel.org/r/20200623174037.3951353-18-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently there are two lists of kmem_caches:
    1) slab_caches, which contains all kmem_caches,
    2) slab_root_caches, which contains only root kmem_caches.

    And there is some preprocessor magic to have a single list if
    CONFIG_MEMCG_KMEM isn't enabled.

    It was required earlier because the number of non-root kmem_caches was
    proportional to the number of memory cgroups and could reach really big
    values. Now, when it cannot exceed the number of root kmem_caches, there
    is really no reason to maintain two lists.

    We never iterate over the slab_root_caches list on any hot paths, so it's
    perfectly fine to iterate over slab_caches and filter out non-root
    kmem_caches.

    It allows to remove a lot of config-dependent code and two pointers from
    the kmem_cache structure.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-16-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This is fairly big but mostly red patch, which makes all accounted slab
    allocations use a single set of kmem_caches instead of creating a separate
    set for each memory cgroup.

    Because the number of non-root kmem_caches is now capped by the number of
    root kmem_caches, there is no need to shrink or destroy them prematurely.
    They can be perfectly destroyed together with their root counterparts.
    This allows to dramatically simplify the management of non-root
    kmem_caches and delete a ton of code.

    This patch performs the following changes:
    1) introduces memcg_params.memcg_cache pointer to represent the
    kmem_cache which will be used for all non-root allocations
    2) reuses the existing memcg kmem_cache creation mechanism
    to create memcg kmem_cache on the first allocation attempt
    3) memcg kmem_caches are named -memcg,
    e.g. dentry-memcg
    4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
    or schedule it's creation and return the root cache
    5) removes almost all non-root kmem_cache management code
    (separate refcounter, reparenting, shrinking, etc)
    6) makes slab debugfs to display root_mem_cgroup css id and never
    show :dead and :deact flags in the memcg_slabinfo attribute.

    Following patches in the series will simplify the kmem_cache creation.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Store the obj_cgroup pointer in the corresponding place of
    page->obj_cgroups for each allocated non-root slab object. Make sure that
    each allocated object holds a reference to obj_cgroup.

    Objcg pointer is obtained from the memcg->objcg dereferencing in
    memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook.
    Then in case of successful allocation(s) it's getting stored in the
    page->obj_cgroups vector.

    The objcg obtaining part look a bit bulky now, but it will be simplified
    by next commits in the series.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Provide the necessary KCSAN checks to assist with debugging racy
    use-after-frees. While KASAN is more reliable at generally catching such
    use-after-frees (due to its use of a quarantine), it can be difficult to
    debug racy use-after-frees. If a reliable reproducer exists, KCSAN can
    assist in debugging such issues.

    Note: ASSERT_EXCLUSIVE_ACCESS is a convenience wrapper if the size is
    simply sizeof(var). Instead, here we just use __kcsan_check_access()
    explicitly to pass the correct size.

    Signed-off-by: Marco Elver
    Signed-off-by: Andrew Morton
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200623072653.114563-1-elver@google.com
    Signed-off-by: Linus Torvalds

    Marco Elver
     
  • cache_from_obj() was added by commit b9ce5ef49f00 ("sl[au]b: always get
    the cache from its page in kmem_cache_free()") to support kmemcg, where
    per-memcg cache can be different from the root one, so we can't use the
    kmem_cache pointer given to kmem_cache_free().

    Prior to that commit, SLUB already had debugging check+warning that could
    be enabled to compare the given kmem_cache pointer to one referenced by
    the slab page where the object-to-be-freed resides. This check was moved
    to cache_from_obj(). Later the check was also enabled for
    SLAB_FREELIST_HARDENED configs by commit 598a0717a816 ("mm/slab: validate
    cache membership under freelist hardening").

    These checks and warnings can be useful especially for the debugging,
    which can be improved. Commit 598a0717a816 changed the pr_err() with
    WARN_ON_ONCE() to WARN_ONCE() so only the first hit is now reported,
    others are silent. This patch changes it to WARN() so that all errors are
    reported.

    It's also useful to print SLUB allocation/free tracking info for the
    offending object, if tracking is enabled. Thus, export the SLUB
    print_tracking() function and provide an empty one for SLAB.

    For SLUB we can also benefit from the static key check in
    kmem_cache_debug_flags(), but we need to move this function to slab.h and
    declare the static key there.

    [1] https://lore.kernel.org/r/20200608230654.828134-18-guro@fb.com

    [vbabka@suse.cz: avoid bogus WARN()]
    Link: https://lore.kernel.org/r/20200623090213.GW5535@shao2-debian
    Link: http://lkml.kernel.org/r/b33e0fa7-cd28-4788-9e54-5927846329ef@suse.cz

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Acked-by: Kees Cook
    Acked-by: Roman Gushchin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Garrett
    Cc: Jann Horn
    Cc: Vijayanand Jitta
    Cc: Vinayak Menon
    Link: http://lkml.kernel.org/r/afeda7ac-748b-33d8-a905-56b708148ad5@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The function cache_from_obj() was added by commit b9ce5ef49f00 ("sl[au]b:
    always get the cache from its page in kmem_cache_free()") to support
    kmemcg, where per-memcg cache can be different from the root one, so we
    can't use the kmem_cache pointer given to kmem_cache_free().

    Prior to that commit, SLUB already had debugging check+warning that could
    be enabled to compare the given kmem_cache pointer to one referenced by
    the slab page where the object-to-be-freed resides. This check was moved
    to cache_from_obj(). Later the check was also enabled for
    SLAB_FREELIST_HARDENED configs by commit 598a0717a816 ("mm/slab: validate
    cache membership under freelist hardening").

    These checks and warnings can be useful especially for the debugging,
    which can be improved. Commit 598a0717a816 changed the pr_err() with
    WARN_ON_ONCE() to WARN_ONCE() so only the first hit is now reported,
    others are silent. This patch changes it to WARN() so that all errors are
    reported.

    It's also useful to print SLUB allocation/free tracking info for the
    offending object, if tracking is enabled. We could export the SLUB
    print_tracking() function and provide an empty one for SLAB, or realize
    that both the debugging and hardening cases in cache_from_obj() are only
    supported by SLUB anyway. So this patch moves cache_from_obj() from
    slab.h to separate instances in slab.c and slub.c, where the SLAB version
    only does the kmemcg lookup and even could be completely removed once the
    kmemcg rework [1] is merged. The SLUB version can thus easily use the
    print_tracking() function. It can also use the kmem_cache_debug_flags()
    static key check for improved performance in kernels without the hardening
    and with debugging not enabled on boot.

    [1] https://lore.kernel.org/r/20200608230654.828134-18-guro@fb.com

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Jann Horn
    Cc: Kees Cook
    Cc: Vijayanand Jitta
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200610163135.17364-10-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • kmem_list3 has been renamed to kmem_cache_node long long ago so update it.

    References:
    6744f087ba2a ("slab: Common name for the per node structures")
    ce8eb6c424c7 ("slab: Rename list3/l3 to node")

    Signed-off-by: Xiao Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200722033355.26908-1-yangx.jy@cn.fujitsu.com
    Signed-off-by: Linus Torvalds

    Xiao Yang
     
  • kmalloc cannot allocate memory from HIGHMEM. Allocating large amounts of
    memory currently bypasses the check and will simply leak the memory when
    page_address() returns NULL. To fix this, factor the GFP_SLAB_BUG_MASK
    check out of slab & slub, and call it from kmalloc_order() as well. In
    order to make the code clear, the warning message is put in one place.

    Signed-off-by: Long Li
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200704035027.GA62481@lilong
    Signed-off-by: Linus Torvalds

    Long Li
     
  • Similar to commit ce6fa91b9363 ("mm/slub.c: add a naive detection of
    double free or corruption"), add a very cheap double-free check for SLAB
    under CONFIG_SLAB_FREELIST_HARDENED. With this added, the
    "SLAB_FREE_DOUBLE" LKDTM test passes under SLAB:

    lkdtm: Performing direct entry SLAB_FREE_DOUBLE
    lkdtm: Attempting double slab free ...
    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 2193 at mm/slab.c:757 ___cache _free+0x325/0x390

    [keescook@chromium.org: fix misplaced __free_one()]
    Link: http://lkml.kernel.org/r/202006261306.0D82A2B@keescook
    Link: https://lore.kernel.org/lkml/7ff248c7-d447-340c-a8e2-8c02972aca70@infradead.org

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Randy Dunlap [build tested]
    Cc: Roman Gushchin
    Cc: Christoph Lameter
    Cc: Alexander Popov
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vinayak Menon
    Cc: Matthew Garrett
    Cc: Jann Horn
    Cc: Vijayanand Jitta
    Link: http://lkml.kernel.org/r/20200625215548.389774-3-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     

04 Jun, 2020

1 commit

  • classzone_idx is just different name for high_zoneidx now. So, integrate
    them and add some comment to struct alloc_context in order to reduce
    future confusion about the meaning of this variable.

    The accessor, ac_classzone_idx() is also removed since it isn't needed
    after integration.

    In addition to integration, this patch also renames high_zoneidx to
    highest_zoneidx since it represents more precise meaning.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Acked-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Ye Xiaolong
    Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

14 Jan, 2020

1 commit

  • Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
    debugging") has introduced a static key to reduce overhead when
    debug_pagealloc is compiled in but not enabled. It relied on the
    assumption that jump_label_init() is called before parse_early_param()
    as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
    it is safe to enable the static key.

    However, it turns out multiple architectures call parse_early_param()
    earlier from their setup_arch(). x86 also calls jump_label_init() even
    earlier, so no issue was found while testing the commit, but same is not
    true for e.g. ppc64 and s390 where the kernel would not boot with
    debug_pagealloc=on as found by our QA.

    To fix this without tricky changes to init code of multiple
    architectures, this patch partially reverts the static key conversion
    from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
    code) of debug_pagealloc_enabled() will again test a simple bool
    variable. Fastpath mm code is converted to a new
    debug_pagealloc_enabled_static() variant that relies on the static key,
    which is enabled in a well-defined point in mm_init() where it's
    guaranteed that jump_label_init() has been called, regardless of
    architecture.

    [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
    Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
    Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Stephen Rothwell
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Qian Cai
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

01 Dec, 2019

2 commits

  • The size of kmalloc can be obtained from kmalloc_info[], so remove
    kmalloc_size() that will not be used anymore.

    Link: http://lkml.kernel.org/r/1569241648-26908-3-git-send-email-lpf.vector@gmail.com
    Signed-off-by: Pengfei Li
    Acked-by: Vlastimil Babka
    Acked-by: Roman Gushchin
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pengfei Li
     
  • Patch series "mm, slab: Make kmalloc_info[] contain all types of names", v6.

    There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
    and KMALLOC_DMA.

    The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
    but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
    generated by kmalloc_cache_name().

    Patch1 predefines the names of all types of kmalloc to save
    the time spent dynamically generating names.

    These changes make sense, and the time spent by new_kmalloc_cache()
    has been reduced by approximately 36.3%.

    Time spent by new_kmalloc_cache()
    (CPU cycles)
    5.3-rc7 66264
    5.3-rc7+patch 42188

    This patch (of 3):

    There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM and
    KMALLOC_DMA.

    The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, but the
    names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically generated by
    kmalloc_cache_name().

    This patch predefines the names of all types of kmalloc to save the time
    spent dynamically generating names.

    Besides, remove the kmalloc_cache_name() that is no longer used.

    Link: http://lkml.kernel.org/r/1569241648-26908-2-git-send-email-lpf.vector@gmail.com
    Signed-off-by: Pengfei Li
    Acked-by: Vlastimil Babka
    Acked-by: Roman Gushchin
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pengfei Li
     

15 Oct, 2019

1 commit

  • Fix kernel-doc warning in mm/slab.c:

    mm/slab.c:4215: warning: Function parameter or member 'objp' not described in '__ksize'

    Also add Return: documentation section for this function.

    Link: http://lkml.kernel.org/r/68c9fd7d-f09e-d376-e292-c7b2bdf1774d@infradead.org
    Fixes: 10d1f8cb3965 ("mm/slab: refactor common ksize KASAN logic into slab_common.c")
    Signed-off-by: Randy Dunlap
    Acked-by: Marco Elver
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

13 Jul, 2019

6 commits

  • Patch series "add init_on_alloc/init_on_free boot options", v10.

    Provide init_on_alloc and init_on_free boot options.

    These are aimed at preventing possible information leaks and making the
    control-flow bugs that depend on uninitialized values more deterministic.

    Enabling either of the options guarantees that the memory returned by the
    page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
    isn't supported at the moment, as its emulation of kmem caches complicates
    handling of SLAB_TYPESAFE_BY_RCU caches correctly.

    Enabling init_on_free also guarantees that pages and heap objects are
    initialized right after they're freed, so it won't be possible to access
    stale data by using a dangling pointer.

    As suggested by Michal Hocko, right now we don't let the heap users to
    disable initialization for certain allocations. There's not enough
    evidence that doing so can speed up real-life cases, and introducing ways
    to opt-out may result in things going out of control.

    This patch (of 2):

    The new options are needed to prevent possible information leaks and make
    control-flow bugs that depend on uninitialized values more deterministic.

    This is expected to be on-by-default on Android and Chrome OS. And it
    gives the opportunity for anyone else to use it under distros too via the
    boot args. (The init_on_free feature is regularly requested by folks
    where memory forensics is included in their threat models.)

    init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
    objects with zeroes. Initialization is done at allocation time at the
    places where checks for __GFP_ZERO are performed.

    init_on_free=1 makes the kernel initialize freed pages and heap objects
    with zeroes upon their deletion. This helps to ensure sensitive data
    doesn't leak via use-after-free accesses.

    Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
    returns zeroed memory. The two exceptions are slab caches with
    constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
    zero-initialized to preserve their semantics.

    Both init_on_alloc and init_on_free default to zero, but those defaults
    can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
    CONFIG_INIT_ON_FREE_DEFAULT_ON.

    If either SLUB poisoning or page poisoning is enabled, those options take
    precedence over init_on_alloc and init_on_free: initialization is only
    applied to unpoisoned allocations.

    Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:

    hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
    hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)

    Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
    Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
    Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
    Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)

    The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
    is within the standard error.

    The new features are also going to pave the way for hardware memory
    tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
    hooks to set the tags for heap objects. With MTE, tagging will have the
    same cost as memory initialization.

    Although init_on_free is rather costly, there are paranoid use-cases where
    in-memory data lifetime is desired to be minimized. There are various
    arguments for/against the realism of the associated threat models, but
    given that we'll need the infrastructure for MTE anyway, and there are
    people who want wipe-on-free behavior no matter what the performance cost,
    it seems reasonable to include it in this series.

    [glider@google.com: v8]
    Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
    [glider@google.com: v10]
    Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
    Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
    Signed-off-by: Alexander Potapenko
    Acked-by: Kees Cook
    Acked-by: Michal Hocko [page and dmapool parts
    Acked-by: James Morris ]
    Cc: Christoph Lameter
    Cc: Masahiro Yamada
    Cc: "Serge E. Hallyn"
    Cc: Nick Desaulniers
    Cc: Kostya Serebryany
    Cc: Dmitry Vyukov
    Cc: Sandeep Patil
    Cc: Laura Abbott
    Cc: Randy Dunlap
    Cc: Jann Horn
    Cc: Mark Rutland
    Cc: Marco Elver
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Currently the page accounting code is duplicated in SLAB and SLUB
    internals. Let's move it into new (un)charge_slab_page helpers in the
    slab_common.c file. These helpers will be responsible for statistics
    (global and memcg-aware) and memcg charging. So they are replacing direct
    memcg_(un)charge_slab() calls.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Christoph Lameter
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently SLUB uses a work scheduled after an RCU grace period to
    deactivate a non-root kmem_cache. This mechanism can be reused for
    kmem_caches release, but requires generalization for SLAB case.

    Introduce kmemcg_cache_deactivate() function, which calls
    allocator-specific __kmem_cache_deactivate() and schedules execution of
    __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
    context after an rcu grace period.

    Here is the new calling scheme:
    kmemcg_cache_deactivate()
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    __kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific

    instead of:
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    slab_deactivate_memcg_cache_rcu_sched() SLUB-only
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    kmemcg_cache_deact_after_rcu() SLUB-only

    For consistency, all allocator-specific functions start with "__".

    Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "mm: reparent slab memory on cgroup removal", v7.

    # Why do we need this?

    We've noticed that the number of dying cgroups is steadily growing on most
    of our hosts in production. The following investigation revealed an issue
    in the userspace memory reclaim code [1], accounting of kernel stacks [2],
    and also the main reason: slab objects.

    The underlying problem is quite simple: any page charged to a cgroup holds
    a reference to it, so the cgroup can't be reclaimed unless all charged
    pages are gone. If a slab object is actively used by other cgroups, it
    won't be reclaimed, and will prevent the origin cgroup from being
    reclaimed.

    Slab objects, and first of all vfs cache, is shared between cgroups, which
    are using the same underlying fs, and what's even more important, it's
    shared between multiple generations of the same workload. So if something
    is running periodically every time in a new cgroup (like how systemd
    works), we do accumulate multiple dying cgroups.

    Strictly speaking pagecache isn't different here, but there is a key
    difference: we disable protection and apply some extra pressure on LRUs of
    dying cgroups, and these LRUs contain all charged pages. My experiments
    show that with the disabled kernel memory accounting the number of dying
    cgroups stabilizes at a relatively small number (~100, depends on memory
    pressure and cgroup creation rate), and with kernel memory accounting it
    grows pretty steadily up to several thousands.

    Memory cgroups are quite complex and big objects (mostly due to percpu
    stats), so it leads to noticeable memory losses. Memory occupied by dying
    cgroups is measured in hundreds of megabytes. I've even seen a host with
    more than 100Gb of memory wasted for dying cgroups. It leads to a
    degradation of performance with the uptime, and generally limits the usage
    of cgroups.

    My previous attempt [3] to fix the problem by applying extra pressure on
    slab shrinker lists caused a regressions with xfs and ext4, and has been
    reverted [4]. The following attempts to find the right balance [5, 6]
    were not successful.

    So instead of trying to find a maybe non-existing balance, let's do
    reparent accounted slab caches to the parent cgroup on cgroup removal.

    # Implementation approach

    There is however a significant problem with reparenting of slab memory:
    there is no list of charged pages. Some of them are in shrinker lists,
    but not all. Introducing of a new list is really not an option.

    But fortunately there is a way forward: every slab page has a stable
    pointer to the corresponding kmem_cache. So the idea is to reparent
    kmem_caches instead of slab pages.

    It's actually simpler and cheaper, but requires some underlying changes:
    1) Make kmem_caches to hold a single reference to the memory cgroup,
    instead of a separate reference per every slab page.
    2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
    page->kmem_cache->memcg indirection instead. It's used only on
    slab page release, so performance overhead shouldn't be a big issue.
    3) Introduce a refcounter for non-root slab caches. It's required to
    be able to destroy kmem_caches when they become empty and release
    the associated memory cgroup.

    There is a bonus: currently we release all memcg kmem_caches all together
    with the memory cgroup itself. This patchset allows individual
    kmem_caches to be released as soon as they become inactive and free.

    Some additional implementation details are provided in corresponding
    commit messages.

    # Results

    Below is the average number of dying cgroups on two groups of our
    production hosts. They do run some sort of web frontend workload, the
    memory pressure is moderate. As we can see, with the kernel memory
    reparenting the number stabilizes in 60s range; however with the original
    version it grows almost linearly and doesn't show any signs of plateauing.
    The difference in slab and percpu usage between patched and unpatched
    versions also grows linearly. In 7 days it exceeded 200Mb.

    day 0 1 2 3 4 5 6 7
    original 56 362 628 752 1070 1250 1490 1560
    patched 23 46 51 55 60 57 67 69
    mem diff(Mb) 22 74 123 152 164 182 214 241

    # Links

    [1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error")
    [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
    [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
    [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
    [5]: https://lkml.org/lkml/2019/1/28/1865
    [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2

    This patch (of 10):

    Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
    rather than in init_memcg_params().

    Once kmem_cache will hold a reference to the memory cgroup, it will
    simplify the refcounting.

    For non-root kmem_caches memcg_link_cache() is always called before the
    kmem_cache becomes visible to a user, so it's safe.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Waiman Long
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This refactors common code of ksize() between the various allocators into
    slab_common.c: __ksize() is the allocator-specific implementation without
    instrumentation, whereas ksize() includes the required KASAN logic.

    Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.com
    Signed-off-by: Marco Elver
    Acked-by: Christoph Lameter
    Reviewed-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mark Rutland
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marco Elver
     
  • This avoids any possible type confusion when looking up an object. For
    example, if a non-slab were to be passed to kfree(), the invalid
    slab_cache pointer (i.e. overlapped with some other value from the
    struct page union) would be used for subsequent slab manipulations that
    could lead to further memory corruption.

    Since the page is already in cache, adding the PageSlab() check will
    have nearly zero cost, so add a check and WARN() to virt_to_cache().
    Additionally replaces an open-coded virt_to_cache(). To support the
    failure mode this also updates all callers of virt_to_cache() and
    cache_from_obj() to handle a NULL cache pointer return value (though
    note that several already handle this case gracefully).

    [dan.carpenter@oracle.com: restore IRQs in kfree()]
    Link: http://lkml.kernel.org/r/20190613065637.GE16334@mwanda
    Link: http://lkml.kernel.org/r/20190530045017.15252-3-keescook@chromium.org
    Signed-off-by: Kees Cook
    Signed-off-by: Dan Carpenter
    Cc: Alexander Popov
    Cc: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Greg Kroah-Hartman
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

17 May, 2019

1 commit

  • It turned out that DEBUG_SLAB_LEAK is still broken even after recent
    recue efforts that when there is a large number of objects like
    kmemleak_object which is normal on a debug kernel,

    # grep kmemleak /proc/slabinfo
    kmemleak_object 2243606 3436210 ...

    reading /proc/slab_allocators could easily loop forever while processing
    the kmemleak_object cache and any additional freeing or allocating
    objects will trigger a reprocessing. To make a situation worse,
    soft-lockups could easily happen in this sitatuion which will call
    printk() to allocate more kmemleak objects to guarantee an infinite
    loop.

    Also, since it seems no one had noticed when it was totally broken
    more than 2-year ago - see the commit fcf88917dd43 ("slab: fix a crash
    by reading /proc/slab_allocators"), probably nobody cares about it
    anymore due to the decline of the SLAB. Just remove it entirely.

    Suggested-by: Vlastimil Babka
    Suggested-by: Linus Torvalds
    Signed-off-by: Qian Cai
    Signed-off-by: Linus Torvalds

    Qian Cai
     

15 May, 2019

3 commits

  • "cat /proc/slab_allocators" could hang forever on SMP machines with
    kmemleak or object debugging enabled due to other CPUs running do_drain()
    will keep making kmemleak_object or debug_objects_cache dirty and unable
    to escape the first loop in leaks_show(),

    do {
    set_store_user_clean(cachep);
    drain_cpu_caches(cachep);
    ...

    } while (!is_store_user_clean(cachep));

    For example,

    do_drain
    slabs_destroy
    slab_destroy
    kmem_cache_free
    __cache_free
    ___cache_free
    kmemleak_free_recursive
    delete_object_full
    __delete_object
    put_object
    free_object_rcu
    kmem_cache_free
    cache_free_debugcheck --> dirty kmemleak_object

    One approach is to check cachep->name and skip both kmemleak_object and
    debug_objects_cache in leaks_show(). The other is to set store_user_clean
    after drain_cpu_caches() which leaves a small window between
    drain_cpu_caches() and set_store_user_clean() where per-CPU caches could
    be dirty again lead to slightly wrong information has been stored but
    could also speed up things significantly which sounds like a good
    compromise. For example,

    # cat /proc/slab_allocators
    0m42.778s # 1st approach
    0m0.737s # 2nd approach

    [akpm@linux-foundation.org: tweak comment]
    Link: http://lkml.kernel.org/r/20190411032635.10325-1-cai@lca.pw
    Fixes: d31676dfde25 ("mm/slab: alternative implementation for DEBUG_SLAB_LEAK")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • nc is a member of percpu allocation memory, and cannot be NULL.

    Link: http://lkml.kernel.org/r/1553159353-5056-1-git-send-email-lirongqing@baidu.com
    Signed-off-by: Li RongQing
    Reviewed-by: Andrew Morton
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     
  • Currently we use the page->lru list for maintaining lists of slabs. We
    have a list in the page structure (slab_list) that can be used for this
    purpose. Doing so makes the code cleaner since we are not overloading the
    lru list.

    Use the slab_list instead of the lru list for maintaining lists of slabs.

    Link: http://lkml.kernel.org/r/20190402230545.2929-7-tobin@kernel.org
    Signed-off-by: Tobin C. Harding
    Acked-by: Christoph Lameter
    Reviewed-by: Roman Gushchin
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobin C. Harding
     

07 May, 2019

1 commit

  • Pull x86 irq updates from Ingo Molnar:
    "Here are the main changes in this tree:

    - Introduce x86-64 IRQ/exception/debug stack guard pages to detect
    stack overflows immediately and deterministically.

    - Clean up over a decade worth of cruft accumulated.

    The outcome of this should be more clear-cut faults/crashes when any
    of the low level x86 CPU stacks overflow, instead of silent memory
    corruption and sporadic failures much later on"

    * 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    x86/irq: Fix outdated comments
    x86/irq/64: Remove stack overflow debug code
    x86/irq/64: Remap the IRQ stack with guard pages
    x86/irq/64: Split the IRQ stack into its own pages
    x86/irq/64: Init hardirq_stack_ptr during CPU hotplug
    x86/irq/32: Handle irq stack allocation failure proper
    x86/irq/32: Invoke irq_ctx_init() from init_IRQ()
    x86/irq/64: Rename irq_stack_ptr to hardirq_stack_ptr
    x86/irq/32: Rename hard/softirq_stack to hard/softirq_stack_ptr
    x86/irq/32: Make irq stack a character array
    x86/irq/32: Define IRQ_STACK_SIZE
    x86/dumpstack/64: Speedup in_exception_stack()
    x86/exceptions: Split debug IST stack
    x86/exceptions: Enable IST guard pages
    x86/exceptions: Disconnect IST index and stack order
    x86/cpu: Remove orig_ist array
    x86/cpu: Prepare TSS.IST setup for guard pages
    x86/dumpstack/64: Use cpu_entry_area instead of orig_ist
    x86/irq/64: Use cpu entry area instead of orig_ist
    x86/traps: Use cpu_entry_area instead of orig_ist
    ...

    Linus Torvalds
     

20 Apr, 2019

1 commit

  • Commit 51dedad06b5f ("kasan, slab: make freelist stored without tags")
    calls kasan_reset_tag() for off-slab slab management object leading to
    freelist being stored non-tagged.

    However, cache_grow_begin() calls alloc_slabmgmt() which calls
    kmem_cache_alloc_node() assigns a tag for the address and stores it in
    the shadow address. As the result, it causes endless errors below
    during boot due to drain_freelist() -> slab_destroy() ->
    kasan_slab_free() which compares already untagged freelist against the
    stored tag in the shadow address.

    Since off-slab slab management object freelist is such a special case,
    just store it tagged. Non-off-slab management object freelist is still
    stored untagged which has not been assigned a tag and should not cause
    any other troubles with this inconsistency.

    BUG: KASAN: double-free or invalid-free in slab_destroy+0x84/0x88
    Pointer tag: [ff], memory tag: [99]

    CPU: 0 PID: 1376 Comm: kworker/0:4 Tainted: G W 5.1.0-rc3+ #8
    Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018
    Workqueue: cgroup_destroy css_killed_work_fn
    Call trace:
    print_address_description+0x74/0x2a4
    kasan_report_invalid_free+0x80/0xc0
    __kasan_slab_free+0x204/0x208
    kasan_slab_free+0xc/0x18
    kmem_cache_free+0xe4/0x254
    slab_destroy+0x84/0x88
    drain_freelist+0xd0/0x104
    __kmem_cache_shrink+0x1ac/0x224
    __kmemcg_cache_deactivate+0x1c/0x28
    memcg_deactivate_kmem_caches+0xa0/0xe8
    memcg_offline_kmem+0x8c/0x3d4
    mem_cgroup_css_offline+0x24c/0x290
    css_killed_work_fn+0x154/0x618
    process_one_work+0x9cc/0x183c
    worker_thread+0x9b0/0xe38
    kthread+0x374/0x390
    ret_from_fork+0x10/0x18

    Allocated by task 1625:
    __kasan_kmalloc+0x168/0x240
    kasan_slab_alloc+0x18/0x20
    kmem_cache_alloc_node+0x1f8/0x3a0
    cache_grow_begin+0x4fc/0xa24
    cache_alloc_refill+0x2f8/0x3e8
    kmem_cache_alloc+0x1bc/0x3bc
    sock_alloc_inode+0x58/0x334
    alloc_inode+0xb8/0x164
    new_inode_pseudo+0x20/0xec
    sock_alloc+0x74/0x284
    __sock_create+0xb0/0x58c
    sock_create+0x98/0xb8
    __sys_socket+0x60/0x138
    __arm64_sys_socket+0xa4/0x110
    el0_svc_handler+0x2c0/0x47c
    el0_svc+0x8/0xc

    Freed by task 1625:
    __kasan_slab_free+0x114/0x208
    kasan_slab_free+0xc/0x18
    kfree+0x1a8/0x1e0
    single_release+0x7c/0x9c
    close_pdeo+0x13c/0x43c
    proc_reg_release+0xec/0x108
    __fput+0x2f8/0x784
    ____fput+0x1c/0x28
    task_work_run+0xc0/0x1b0
    do_notify_resume+0xb44/0x1278
    work_pending+0x8/0x10

    The buggy address belongs to the object at ffff809681b89e00
    which belongs to the cache kmalloc-128 of size 128
    The buggy address is located 0 bytes inside of
    128-byte region [ffff809681b89e00, ffff809681b89e80)
    The buggy address belongs to the page:
    page:ffff7fe025a06e00 count:1 mapcount:0 mapping:01ff80082000fb00
    index:0xffff809681b8fe04
    flags: 0x17ffffffc000200(slab)
    raw: 017ffffffc000200 ffff7fe025a06d08 ffff7fe022ef7b88 01ff80082000fb00
    raw: ffff809681b8fe04 ffff809681b80000 00000001000000e0 0000000000000000
    page dumped because: kasan: bad access detected
    page allocated via order 0, migratetype Unmovable, gfp_mask
    0x2420c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE)
    prep_new_page+0x4e0/0x5e0
    get_page_from_freelist+0x4ce8/0x50d4
    __alloc_pages_nodemask+0x738/0x38b8
    cache_grow_begin+0xd8/0xa24
    ____cache_alloc_node+0x14c/0x268
    __kmalloc+0x1c8/0x3fc
    ftrace_free_mem+0x408/0x1284
    ftrace_free_init_mem+0x20/0x28
    kernel_init+0x24/0x548
    ret_from_fork+0x10/0x18

    Memory state around the buggy address:
    ffff809681b89c00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
    ffff809681b89d00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
    >ffff809681b89e00: 99 99 99 99 99 99 99 99 fe fe fe fe fe fe fe fe
    ^
    ffff809681b89f00: 43 43 43 43 43 fe fe fe fe fe fe fe fe fe fe fe
    ffff809681b8a000: 6d fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe

    Link: http://lkml.kernel.org/r/20190403022858.97584-1-cai@lca.pw
    Fixes: 51dedad06b5f ("kasan, slab: make freelist stored without tags")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

17 Apr, 2019

1 commit

  • store_stackinfo() does not seem used in actual SLAB debugging.
    Potentially, it could be added to check_poison_obj() to provide more
    information but this seems like an overkill due to the declining
    popularity of SLAB, so just remove it instead.

    Signed-off-by: Qian Cai
    Signed-off-by: Borislav Petkov
    Acked-by: Thomas Gleixner
    Acked-by: Vlastimil Babka
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Josh Poimboeuf
    Cc: linux-mm
    Cc: Pekka Enberg
    Cc: rientjes@google.com
    Cc: sean.j.christopherson@intel.com
    Link: https://lkml.kernel.org/r/20190416142258.18694-1-cai@lca.pw

    Qian Cai
     

08 Apr, 2019

1 commit

  • The commit 510ded33e075 ("slab: implement slab_root_caches list")
    changes the name of the list node within "struct kmem_cache" from "list"
    to "root_caches_node", but leaks_show() still use the "list" which
    causes a crash when reading /proc/slab_allocators.

    You need to have CONFIG_SLAB=y and CONFIG_MEMCG=y to see the problem,
    because without MEMCG all slab caches are root caches, and the "list"
    node happens to be the right one.

    Fixes: 510ded33e075 ("slab: implement slab_root_caches list")
    Signed-off-by: Qian Cai
    Reviewed-by: Tobin C. Harding
    Cc: Tejun Heo
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

30 Mar, 2019

1 commit

  • Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
    v6.

    This is a followup to the discussion in [1], [2].

    IOMMUs using ARMv7 short-descriptor format require page tables (level 1
    and 2) to be allocated within the first 4GB of RAM, even on 64-bit
    systems.

    For L1 tables that are bigger than a page, we can just use
    __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
    use GFP_DMA).

    For L2 tables that only take 1KB, it would be a waste to allocate a full
    page, so we considered 3 approaches:
    1. This series, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2 page
    tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable to reuse
    freed fragments until the whole page is freed. [3]

    This series is the most memory-efficient approach.

    stable@ note:
    We confirmed that this is a regression, and IOMMU errors happen on 4.19
    and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
    most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
    with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
    platforms (and maybe others?).

    [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
    [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
    [3] https://patchwork.codeaurora.org/patch/671639/

    This patch (of 3):

    IOMMUs using ARMv7 short-descriptor format require page tables to be
    allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
    this is done by passing GFP_DMA32 flag to memory allocation functions.

    For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
    a full page using get_free_pages, so we considered 3 approaches:
    1. This patch, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2
    page tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable
    to reuse freed fragments until the whole page is freed.

    This change makes it possible to create a custom cache in DMA32 zone using
    kmem_cache_create, then allocate memory using kmem_cache_alloc.

    We do not create a DMA32 kmalloc cache array, as there are currently no
    users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
    warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.

    This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
    kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
    unnecessary).

    Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
    Signed-off-by: Nicolas Boichat
    Acked-by: Vlastimil Babka
    Acked-by: Will Deacon
    Cc: Robin Murphy
    Cc: Joerg Roedel
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Sasha Levin
    Cc: Huaisheng Ye
    Cc: Mike Rapoport
    Cc: Yong Wu
    Cc: Matthias Brugger
    Cc: Tomasz Figa
    Cc: Yingjoe Chen
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Hsin-Yi Wang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Boichat
     

06 Mar, 2019

3 commits

  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Number of NUMA nodes can't be negative.

    This saves a few bytes on x86_64:

    add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
    Function old new delta
    hv_synic_alloc.cold 88 110 +22
    prealloc_shrinker 260 262 +2
    bootstrap 249 251 +2
    sched_init_numa 1566 1567 +1
    show_slab_objects 778 777 -1
    s_show 1201 1200 -1
    kmem_cache_init 346 345 -1
    __alloc_workqueue_key 1146 1145 -1
    mem_cgroup_css_alloc 1614 1612 -2
    __do_sys_swapon 4702 4699 -3
    __list_lru_init 655 651 -4
    nic_probe 2379 2374 -5
    store_user_store 118 111 -7
    red_zone_store 106 99 -7
    poison_store 106 99 -7
    wq_numa_init 348 338 -10
    __kmem_cache_empty 75 65 -10
    task_numa_free 186 173 -13
    merge_across_nodes_store 351 336 -15
    irq_create_affinity_masks 1261 1246 -15
    do_numa_crng_init 343 321 -22
    task_numa_fault 4760 4737 -23
    swapfile_init 179 156 -23
    hv_synic_alloc 536 492 -44
    apply_wqattrs_prepare 746 695 -51

    Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Kmemleak throws endless warnings during boot due to in
    __alloc_alien_cache(),

    alc = kmalloc_node(memsize, gfp, node);
    init_arraycache(&alc->ac, entries, batch);
    kmemleak_no_scan(ac);

    Kmemleak does not track the array cache (alc->ac) but the alien cache
    (alc) instead, so let it track the latter by lifting kmemleak_no_scan()
    out of init_arraycache().

    There is another place that calls init_arraycache(), but
    alloc_kmem_cache_cpus() uses the percpu allocation where will never be
    considered as a leak.

    kmemleak: Found object by alias at 0xffff8007b9aa7e38
    CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
    Call trace:
    dump_backtrace+0x0/0x168
    show_stack+0x24/0x30
    dump_stack+0x88/0xb0
    lookup_object+0x84/0xac
    find_and_get_object+0x84/0xe4
    kmemleak_no_scan+0x74/0xf4
    setup_kmem_cache_node+0x2b4/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    ret_from_fork+0x10/0x18
    kmemleak: Object 0xffff8007b9aa7e00 (size 256):
    kmemleak: comm "swapper/0", pid 1, jiffies 4294697137
    kmemleak: min_count = 1
    kmemleak: count = 0
    kmemleak: flags = 0x1
    kmemleak: checksum = 0
    kmemleak: backtrace:
    kmemleak_alloc+0x84/0xb8
    kmem_cache_alloc_node_trace+0x31c/0x3a0
    __kmalloc_node+0x58/0x78
    setup_kmem_cache_node+0x26c/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
    CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
    Call trace:
    dump_backtrace+0x0/0x168
    show_stack+0x24/0x30
    dump_stack+0x88/0xb0
    kmemleak_no_scan+0x90/0xf4
    setup_kmem_cache_node+0x2b4/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    ret_from_fork+0x10/0x18

    Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
    Fixes: 1fe00d50a9e8 ("slab: factor out initialization of array cache")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

22 Feb, 2019

1 commit

  • kasan_slab_alloc() calls in kmem_cache_alloc() and kmem_cache_alloc_node()
    are redundant as they are already called via slab_alloc/slab_alloc_node()->
    slab_post_alloc_hook()->kasan_slab_alloc(). Remove them.

    Link: http://lkml.kernel.org/r/4ca1655cdcfc4379c49c50f7bf80f81c4ad01485.1550602886.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Tested-by: Qian Cai
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Catalin Marinas
    Cc: Dmitry Vyukov
    Cc: Evgeniy Stepanov
    Cc: Kostya Serebryany
    Cc: Vincenzo Frascino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov