15 Nov, 2020

1 commit

  • While doing memory hot-unplug operation on a PowerPC VM running 1024 CPUs
    with 11TB of ram, I hit the following panic:

    BUG: Kernel NULL pointer dereference on read at 0x00000007
    Faulting instruction address: 0xc000000000456048
    Oops: Kernel access of bad area, sig: 11 [#2]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS= 2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp
    CPU: 160 PID: 1 Comm: systemd Tainted: G D 5.9.0 #1
    NIP: c000000000456048 LR: c000000000455fd4 CTR: c00000000047b350
    REGS: c00006028d1b77a0 TRAP: 0300 Tainted: G D (5.9.0)
    MSR: 8000000000009033 CR: 24004228 XER: 00000000
    CFAR: c00000000000f1b0 DAR: 0000000000000007 DSISR: 40000000 IRQMASK: 0
    GPR00: c000000000455fd4 c00006028d1b7a30 c000000001bec800 0000000000000000
    GPR04: 0000000000000dc0 0000000000000000 00000000000374ef c00007c53df99320
    GPR08: 000007c53c980000 0000000000000000 000007c53c980000 0000000000000000
    GPR12: 0000000000004400 c00000001e8e4400 0000000000000000 0000000000000f6a
    GPR16: 0000000000000000 c000000001c25930 c000000001d62528 00000000000000c1
    GPR20: c000000001d62538 c00006be469e9000 0000000fffffffe0 c0000000003c0ff8
    GPR24: 0000000000000018 0000000000000000 0000000000000dc0 0000000000000000
    GPR28: c00007c513755700 c000000001c236a4 c00007bc4001f800 0000000000000001
    NIP [c000000000456048] __kmalloc_node+0x108/0x790
    LR [c000000000455fd4] __kmalloc_node+0x94/0x790
    Call Trace:
    kvmalloc_node+0x58/0x110
    mem_cgroup_css_online+0x10c/0x270
    online_css+0x48/0xd0
    cgroup_apply_control_enable+0x2c4/0x470
    cgroup_mkdir+0x408/0x5f0
    kernfs_iop_mkdir+0x90/0x100
    vfs_mkdir+0x138/0x250
    do_mkdirat+0x154/0x1c0
    system_call_exception+0xf8/0x200
    system_call_common+0xf0/0x27c
    Instruction dump:
    e93e0000 e90d0030 39290008 7cc9402a e94d0030 e93e0000 7ce95214 7f89502a
    2fbc0000 419e0018 41920230 e9270010 7f994800 419e0220 7ee6bb78

    This pointing to the following code:

    mm/slub.c:2851
    if (unlikely(!object || !node_match(page, node))) {
    c000000000456038: 00 00 bc 2f cmpdi cr7,r28,0
    c00000000045603c: 18 00 9e 41 beq cr7,c000000000456054
    node_match():
    mm/slub.c:2491
    if (node != NUMA_NO_NODE && page_to_nid(page) != node)
    c000000000456040: 30 02 92 41 beq cr4,c000000000456270
    page_to_nid():
    include/linux/mm.h:1294
    c000000000456044: 10 00 27 e9 ld r9,16(r7)
    c000000000456048: 07 00 29 89 lbz r9,7(r9) <<<< r9 = NULL
    node_match():
    mm/slub.c:2491
    c00000000045604c: 00 48 99 7f cmpw cr7,r25,r9
    c000000000456050: 20 02 9e 41 beq cr7,c000000000456270

    The panic occurred in slab_alloc_node() when checking for the page's node:

    object = c->freelist;
    page = c->page;
    if (unlikely(!object || !node_match(page, node))) {
    object = __slab_alloc(s, gfpflags, node, addr, c);
    stat(s, ALLOC_SLOWPATH);

    The issue is that object is not NULL while page is NULL which is odd but
    may happen if the cache flush happened after loading object but before
    loading page. Thus checking for the page pointer is required too.

    The cache flush is done through an inter processor interrupt when a
    piece of memory is off-lined. That interrupt is triggered when a memory
    hot-unplug operation is initiated and offline_pages() is calling the
    slub's MEM_GOING_OFFLINE callback slab_mem_going_offline_callback()
    which is calling flush_cpu_slab(). If that interrupt is caught between
    the reading of c->freelist and the reading of c->page, this could lead
    to such a situation. That situation is expected and the later call to
    this_cpu_cmpxchg_double() will detect the change to c->freelist and redo
    the whole operation.

    In commit 6159d0f5c03e ("mm/slub.c: page is always non-NULL in
    node_match()") check on the page pointer has been removed assuming that
    page is always valid when it is called. It happens that this is not
    true in that particular case, so check for page before calling
    node_match() here.

    Fixes: 6159d0f5c03e ("mm/slub.c: page is always non-NULL in node_match()")
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Cc: Wei Yang
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Nathan Lynch
    Cc: Scott Cheloha
    Cc: Michal Hocko
    Cc:
    Link: https://lkml.kernel.org/r/20201027190406.33283-1-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     

17 Oct, 2020

1 commit

  • Correct one function name "get_partials" with "get_partial". Update the
    old struct name of list3 with kmem_cache_node.

    Signed-off-by: Chen Tao
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Rapoport
    Link: https://lkml.kernel.org/r/Message-ID:
    Signed-off-by: Linus Torvalds

    Chen Tao
     

14 Oct, 2020

4 commits

  • Object cgroup charging is done for all the objects during allocation, but
    during freeing, uncharging ends up happening for only one object in the
    case of bulk allocation/freeing.

    Fix this by having a separate call to uncharge all the objects from
    kmem_cache_free_bulk() and by modifying memcg_slab_free_hook() to take
    care of bulk uncharging.

    Fixes: 964d4bd370d5 ("mm: memcg/slab: save obj_cgroup for non-root slab objects"
    Signed-off-by: Bharata B Rao
    Signed-off-by: Andrew Morton
    Acked-by: Roman Gushchin
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc:
    Link: https://lkml.kernel.org/r/20201009060423.390479-1-bharata@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Bharata B Rao
     
  • The commit below is incomplete, as it didn't handle the add_full() part.
    commit a4d3f8916c65 ("slub: remove useless kmem_cache_debug() before
    remove_full()")

    This patch checks for SLAB_STORE_USER instead of kmem_cache_debug(), since
    that should be the only context in which we need the list_lock for
    add_full().

    Signed-off-by: Abel Wu
    Signed-off-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Liu Xiang
    Link: https://lkml.kernel.org/r/20200811020240.1231-1-wuyun.wu@huawei.com
    Signed-off-by: Linus Torvalds

    Abel Wu
     
  • The ALLOC_SLOWPATH statistics is missing in bulk allocation now. Fix it
    by doing statistics in alloc slow path.

    Signed-off-by: Abel Wu
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Hewenliang
    Cc: Hu Shiyuan
    Link: http://lkml.kernel.org/r/20200811022427.1363-1-wuyun.wu@huawei.com
    Signed-off-by: Linus Torvalds

    Abel Wu
     
  • The two conditions are mutually exclusive and gcc compiler will optimise
    this into if-else-like pattern. Given that the majority of free_slowpath
    is free_frozen, let's provide some hint to the compilers.

    Tests (perf bench sched messaging -g 20 -l 400000, executed 10x
    after reboot) are done and the summarized result:

    un-patched patched
    max. 192.316 189.851
    min. 187.267 186.252
    avg. 189.154 188.086
    stdev. 1.37 0.99

    Signed-off-by: Abel Wu
    Signed-off-by: Andrew Morton
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Hewenliang
    Cc: Hu Shiyuan
    Link: http://lkml.kernel.org/r/20200813101812.1617-1-wuyun.wu@huawei.com
    Signed-off-by: Linus Torvalds

    Abel Wu
     

04 Oct, 2020

1 commit

  • The routine that applies debug flags to the kmem_cache slabs
    inadvertantly prevents non-debug flags from being applied to those
    same objects. That is, if slub_debug=, is specified,
    non-debugged slabs will end up having flags of zero, and the slabs
    may be unusable.

    Fix this by including the input flags for non-matching slabs with the
    contents of slub_debug, so that the caches are created as expected
    alongside any debugging options that may be requested. With this, we
    can remove the check for a NULL slub_debug_string, since it's covered
    by the loop itself.

    Fixes: e17f1dfba37b ("mm, slub: extend slub_debug syntax for multiple blocks")
    Signed-off-by: Eric Farman
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Kees Cook
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: https://lkml.kernel.org/r/20200930161931.28575-1-farman@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Eric Farman
     

06 Sep, 2020

1 commit

  • Commit 52f23478081ae0 ("mm/slub.c: fix corrupted freechain in
    deactivate_slab()") suffered an update when picked up from LKML [1].

    Specifically, relocating 'freelist = NULL' into 'freelist_corrupted()'
    created a no-op statement. Fix it by sticking to the behavior intended
    in the original patch [1]. In addition, make freelist_corrupted()
    immune to passing NULL instead of &freelist.

    The issue has been spotted via static analysis and code review.

    [1] https://lore.kernel.org/linux-mm/20200331031450.12182-1-dongli.zhang@oracle.com/

    Fixes: 52f23478081ae0 ("mm/slub.c: fix corrupted freechain in deactivate_slab()")
    Signed-off-by: Eugeniu Rosca
    Signed-off-by: Andrew Morton
    Cc: Dongli Zhang
    Cc: Joe Jin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Link: https://lkml.kernel.org/r/20200824130643.10291-1-erosca@de.adit-jv.com
    Signed-off-by: Linus Torvalds

    Eugeniu Rosca
     

08 Aug, 2020

21 commits

  • charge_slab_page() and uncharge_slab_page() are not related anymore to
    memcg charging and uncharging. In order to make their names less
    confusing, let's rename them to account_slab_page() and
    unaccount_slab_page() respectively.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200707173612.124425-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • charge_slab_page() is not using the gfp argument anymore,
    remove it.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200707173612.124425-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Instead of having two sets of kmem_caches: one for system-wide and
    non-accounted allocations and the second one shared by all accounted
    allocations, we can use just one.

    The idea is simple: space for obj_cgroup metadata can be allocated on
    demand and filled only for accounted allocations.

    It allows to remove a bunch of code which is required to handle kmem_cache
    clones for accounted allocations. There is no more need to create them,
    accumulate statistics, propagate attributes, etc. It's a quite
    significant simplification.

    Also, because the total number of slab_caches is reduced almost twice (not
    all kmem_caches have a memcg clone), some additional memory savings are
    expected. On my devvm it additionally saves about 3.5% of slab memory.

    [guro@fb.com: fix build on MIPS]
    Link: http://lkml.kernel.org/r/20200717214810.3733082-1-guro@fb.com

    Suggested-by: Johannes Weiner
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Naresh Kamboju
    Link: http://lkml.kernel.org/r/20200623174037.3951353-18-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently there are two lists of kmem_caches:
    1) slab_caches, which contains all kmem_caches,
    2) slab_root_caches, which contains only root kmem_caches.

    And there is some preprocessor magic to have a single list if
    CONFIG_MEMCG_KMEM isn't enabled.

    It was required earlier because the number of non-root kmem_caches was
    proportional to the number of memory cgroups and could reach really big
    values. Now, when it cannot exceed the number of root kmem_caches, there
    is really no reason to maintain two lists.

    We never iterate over the slab_root_caches list on any hot paths, so it's
    perfectly fine to iterate over slab_caches and filter out non-root
    kmem_caches.

    It allows to remove a lot of config-dependent code and two pointers from
    the kmem_cache structure.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-16-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This is fairly big but mostly red patch, which makes all accounted slab
    allocations use a single set of kmem_caches instead of creating a separate
    set for each memory cgroup.

    Because the number of non-root kmem_caches is now capped by the number of
    root kmem_caches, there is no need to shrink or destroy them prematurely.
    They can be perfectly destroyed together with their root counterparts.
    This allows to dramatically simplify the management of non-root
    kmem_caches and delete a ton of code.

    This patch performs the following changes:
    1) introduces memcg_params.memcg_cache pointer to represent the
    kmem_cache which will be used for all non-root allocations
    2) reuses the existing memcg kmem_cache creation mechanism
    to create memcg kmem_cache on the first allocation attempt
    3) memcg kmem_caches are named -memcg,
    e.g. dentry-memcg
    4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
    or schedule it's creation and return the root cache
    5) removes almost all non-root kmem_cache management code
    (separate refcounter, reparenting, shrinking, etc)
    6) makes slab debugfs to display root_mem_cgroup css id and never
    show :dead and :deact flags in the memcg_slabinfo attribute.

    Following patches in the series will simplify the kmem_cache creation.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Store the obj_cgroup pointer in the corresponding place of
    page->obj_cgroups for each allocated non-root slab object. Make sure that
    each allocated object holds a reference to obj_cgroup.

    Objcg pointer is obtained from the memcg->objcg dereferencing in
    memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook.
    Then in case of successful allocation(s) it's getting stored in the
    page->obj_cgroups vector.

    The objcg obtaining part look a bit bulky now, but it will be simplified
    by next commits in the series.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-9-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This commit implements SLUB version of the obj_to_index() function, which
    will be required to calculate the offset of obj_cgroup in the obj_cgroups
    vector to store/obtain the objcg ownership data.

    To make it faster, let's repeat the SLAB's trick introduced by commit
    6a2d7a955d8d ("SLAB: use a multiply instead of a divide in
    obj_to_index()") and avoid an expensive division.

    Vlastimil Babka noticed, that SLUB does have already a similar function
    called slab_index(), which is defined only if SLUB_DEBUG is enabled. The
    function does a similar math, but with a division, and it also takes a
    page address instead of a page pointer.

    Let's remove slab_index() and replace it with the new helper
    __obj_to_index(), which takes a page address. obj_to_index() will be a
    simple wrapper taking a page pointer and passing page_address(page) into
    __obj_to_index().

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-5-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • In order to prepare for per-object slab memory accounting, convert
    NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.

    To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
    NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).

    Internally global and per-node counters are stored in pages, however memcg
    and lruvec counters are stored in bytes. This scheme may look weird, but
    only for now. As soon as slab pages will be shared between multiple
    cgroups, global and node counters will reflect the total number of slab
    pages. However memcg and lruvec counters will be used for per-memcg slab
    memory tracking, which will take separate kernel objects in the account.
    Keeping global and node counters in pages helps to avoid additional
    overhead.

    The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
    will fit into atomic_long_t we use for vmstats.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Provide the necessary KCSAN checks to assist with debugging racy
    use-after-frees. While KASAN is more reliable at generally catching such
    use-after-frees (due to its use of a quarantine), it can be difficult to
    debug racy use-after-frees. If a reliable reproducer exists, KCSAN can
    assist in debugging such issues.

    Note: ASSERT_EXCLUSIVE_ACCESS is a convenience wrapper if the size is
    simply sizeof(var). Instead, here we just use __kcsan_check_access()
    explicitly to pass the correct size.

    Signed-off-by: Marco Elver
    Signed-off-by: Andrew Morton
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200623072653.114563-1-elver@google.com
    Signed-off-by: Linus Torvalds

    Marco Elver
     
  • There is no point in using lockdep_assert_held() unlock that is about to
    be unlocked. It works only with lockdep and lockdep will complain if
    spin_unlock() is used on a lock that has not been locked.

    Remove superfluous lockdep_assert_held().

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Andrew Morton
    Cc: Yu Zhao
    Cc: Christopher Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20200618201234.795692-2-bigeasy@linutronix.de
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • cache_from_obj() was added by commit b9ce5ef49f00 ("sl[au]b: always get
    the cache from its page in kmem_cache_free()") to support kmemcg, where
    per-memcg cache can be different from the root one, so we can't use the
    kmem_cache pointer given to kmem_cache_free().

    Prior to that commit, SLUB already had debugging check+warning that could
    be enabled to compare the given kmem_cache pointer to one referenced by
    the slab page where the object-to-be-freed resides. This check was moved
    to cache_from_obj(). Later the check was also enabled for
    SLAB_FREELIST_HARDENED configs by commit 598a0717a816 ("mm/slab: validate
    cache membership under freelist hardening").

    These checks and warnings can be useful especially for the debugging,
    which can be improved. Commit 598a0717a816 changed the pr_err() with
    WARN_ON_ONCE() to WARN_ONCE() so only the first hit is now reported,
    others are silent. This patch changes it to WARN() so that all errors are
    reported.

    It's also useful to print SLUB allocation/free tracking info for the
    offending object, if tracking is enabled. Thus, export the SLUB
    print_tracking() function and provide an empty one for SLAB.

    For SLUB we can also benefit from the static key check in
    kmem_cache_debug_flags(), but we need to move this function to slab.h and
    declare the static key there.

    [1] https://lore.kernel.org/r/20200608230654.828134-18-guro@fb.com

    [vbabka@suse.cz: avoid bogus WARN()]
    Link: https://lore.kernel.org/r/20200623090213.GW5535@shao2-debian
    Link: http://lkml.kernel.org/r/b33e0fa7-cd28-4788-9e54-5927846329ef@suse.cz

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Acked-by: Kees Cook
    Acked-by: Roman Gushchin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Garrett
    Cc: Jann Horn
    Cc: Vijayanand Jitta
    Cc: Vinayak Menon
    Link: http://lkml.kernel.org/r/afeda7ac-748b-33d8-a905-56b708148ad5@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The function cache_from_obj() was added by commit b9ce5ef49f00 ("sl[au]b:
    always get the cache from its page in kmem_cache_free()") to support
    kmemcg, where per-memcg cache can be different from the root one, so we
    can't use the kmem_cache pointer given to kmem_cache_free().

    Prior to that commit, SLUB already had debugging check+warning that could
    be enabled to compare the given kmem_cache pointer to one referenced by
    the slab page where the object-to-be-freed resides. This check was moved
    to cache_from_obj(). Later the check was also enabled for
    SLAB_FREELIST_HARDENED configs by commit 598a0717a816 ("mm/slab: validate
    cache membership under freelist hardening").

    These checks and warnings can be useful especially for the debugging,
    which can be improved. Commit 598a0717a816 changed the pr_err() with
    WARN_ON_ONCE() to WARN_ONCE() so only the first hit is now reported,
    others are silent. This patch changes it to WARN() so that all errors are
    reported.

    It's also useful to print SLUB allocation/free tracking info for the
    offending object, if tracking is enabled. We could export the SLUB
    print_tracking() function and provide an empty one for SLAB, or realize
    that both the debugging and hardening cases in cache_from_obj() are only
    supported by SLUB anyway. So this patch moves cache_from_obj() from
    slab.h to separate instances in slab.c and slub.c, where the SLAB version
    only does the kmemcg lookup and even could be completely removed once the
    kmemcg rework [1] is merged. The SLUB version can thus easily use the
    print_tracking() function. It can also use the kmem_cache_debug_flags()
    static key check for improved performance in kernels without the hardening
    and with debugging not enabled on boot.

    [1] https://lore.kernel.org/r/20200608230654.828134-18-guro@fb.com

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Jann Horn
    Cc: Kees Cook
    Cc: Vijayanand Jitta
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200610163135.17364-10-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • There are few more places in SLUB that could benefit from reduced overhead
    of the static key introduced by a previous patch:

    - setup_object_debug() called on each object in newly allocated slab page
    - setup_page_debug() called on newly allocated slab page
    - __free_slab() called on freed slab page

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Acked-by: Roman Gushchin
    Acked-by: Christoph Lameter
    Cc: Jann Horn
    Cc: Kees Cook
    Cc: Vijayanand Jitta
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200610163135.17364-9-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • There are few places that call kmem_cache_debug(s) (which tests if any of
    debug flags are enabled for a cache) immediately followed by a test for a
    specific flag. The compiler can probably eliminate the extra check, but
    we can make the code nicer by introducing kmem_cache_debug_flags() that
    works like kmem_cache_debug() (including the static key check) but tests
    for specific flag(s). The next patches will add more users.

    [vbabka@suse.cz: change return from int to bool, per Kees. Add VM_WARN_ON_ONCE() for invalid flags, per Roman]
    Link: http://lkml.kernel.org/r/949b90ed-e0f0-07d7-4d21-e30ec0958a7c@suse.cz

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Acked-by: Roman Gushchin
    Acked-by: Christoph Lameter
    Acked-by: Kees Cook
    Cc: Jann Horn
    Cc: Vijayanand Jitta
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200610163135.17364-8-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • One advantage of CONFIG_SLUB_DEBUG is that a generic distro kernel can be
    built with the option enabled, but it's inactive until simply enabled on
    boot, without rebuilding the kernel. With a static key, we can further
    eliminate the overhead of checking whether a cache has a particular debug
    flag enabled if we know that there are no such caches (slub_debug was not
    enabled during boot). We use the same mechanism also for e.g.
    page_owner, debug_pagealloc or kmemcg functionality.

    This patch introduces the static key and makes the general check for
    per-cache debug flags kmem_cache_debug() use it. This benefits several
    call sites, including (slow path but still rather frequent) __slab_free().
    The next patches will add more uses.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Kees Cook
    Acked-by: Roman Gushchin
    Acked-by: Christoph Lameter
    Cc: Jann Horn
    Cc: Vijayanand Jitta
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200610163135.17364-7-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The attribute reflects the SLAB_RECLAIM_ACCOUNT cache flag. It's not
    clear why this attribute was writable in the first place, as it's tied to
    how the cache is used by its creator, it's not a user tunable.
    Furthermore:

    - it affects slab merging, but that's not being checked while toggled
    - if affects whether __GFP_RECLAIMABLE flag is used to allocate page, but
    the runtime toggle doesn't update allocflags
    - it affects cache_vmstat_idx() so runtime toggling might lead to incosistency
    of NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE

    Thus make it read-only.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Kees Cook
    Acked-by: Roman Gushchin
    Cc: Christoph Lameter
    Cc: Jann Horn
    Cc: Vijayanand Jitta
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200610163135.17364-6-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • SLUB_DEBUG creates several files under /sys/kernel/slab// that can
    be read to check if the respective debugging options are enabled for given
    cache. Some options, namely sanity_checks, trace, and failslab can be
    also enabled and disabled at runtime by writing into the files.

    The runtime toggling is racy. Some options disable __CMPXCHG_DOUBLE when
    enabled, which means that in case of concurrent allocations, some can
    still use __CMPXCHG_DOUBLE and some not, leading to potential corruption.
    The s->flags field is also not updated or checked atomically. The
    simplest solution is to remove the runtime toggling. The extended
    slub_debug boot parameter syntax introduced by earlier patch should allow
    to fine-tune the debugging configuration during boot with same
    granularity.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Kees Cook
    Acked-by: Roman Gushchin
    Cc: Christoph Lameter
    Cc: Jann Horn
    Cc: Vijayanand Jitta
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200610163135.17364-5-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • SLUB allows runtime changing of page allocation order by writing into the
    /sys/kernel/slab//order file. Jann has reported [1] that this
    interface allows the order to be set too small, leading to crashes.

    While it's possible to fix the immediate issue, closer inspection reveals
    potential races. Storing the new order calls calculate_sizes() which
    non-atomically updates a lot of kmem_cache fields while the cache is still
    in use. Unexpected behavior might occur even if the fields are set to the
    same value as they were.

    This could be fixed by splitting out the part of calculate_sizes() that
    depends on forced_order, so that we only update kmem_cache.oo field. This
    could still race with init_cache_random_seq(), shuffle_freelist(),
    allocate_slab(). Perhaps it's possible to audit and e.g. add some
    READ_ONCE/WRITE_ONCE accesses, it might be easier just to remove the
    runtime order changes, which is what this patch does. If there are valid
    usecases for per-cache order setting, we could e.g. extend the boot
    parameters to do that.

    [1] https://lore.kernel.org/r/CAG48ez31PP--h6_FzVyfJ4H86QYczAFPdxtJHUEEan+7VJETAQ@mail.gmail.com

    Reported-by: Jann Horn
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Kees Cook
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: Vijayanand Jitta
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200610163135.17364-4-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • SLUB_DEBUG creates several files under /sys/kernel/slab// that can
    be read to check if the respective debugging options are enabled for given
    cache. The options can be also toggled at runtime by writing into the
    files. Some of those, namely red_zone, poison, and store_user can be
    toggled only when no objects yet exist in the cache.

    Vijayanand reports [1] that there is a problem with freelist randomization
    if changing the debugging option's state results in different number of
    objects per page, and the random sequence cache needs thus needs to be
    recomputed.

    However, another problem is that the check for "no objects yet exist in
    the cache" is racy, as noted by Jann [2] and fixing that would add
    overhead or otherwise complicate the allocation/freeing paths. Thus it
    would be much simpler just to remove the runtime toggling support. The
    documentation describes it's "In case you forgot to enable debugging on
    the kernel command line", but the neccessity of having no objects limits
    its usefulness anyway for many caches.

    Vijayanand describes an use case [3] where debugging is enabled for all
    but zram caches for memory overhead reasons, and using the runtime toggles
    was the only way to achieve such configuration. After the previous patch
    it's now possible to do that directly from the kernel boot option, so we
    can remove the dangerous runtime toggles by making the /sys attribute
    files read-only.

    While updating it, also improve the documentation of the debugging /sys files.

    [1] https://lkml.kernel.org/r/1580379523-32272-1-git-send-email-vjitta@codeaurora.org
    [2] https://lore.kernel.org/r/CAG48ez31PP--h6_FzVyfJ4H86QYczAFPdxtJHUEEan+7VJETAQ@mail.gmail.com
    [3] https://lore.kernel.org/r/1383cd32-1ddc-4dac-b5f8-9c42282fa81c@codeaurora.org

    Reported-by: Vijayanand Jitta
    Reported-by: Jann Horn
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Kees Cook
    Acked-by: Roman Gushchin
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200610163135.17364-3-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "slub_debug fixes and improvements".

    The slub_debug kernel boot parameter can either apply a single set of
    options to all caches or a list of caches. There is a use case where
    debugging is applied for all caches and then disabled at runtime for
    specific caches, for performance and memory consumption reasons [1]. As
    runtime changes are dangerous, extend the boot parameter syntax so that
    multiple blocks of either global or slab-specific options can be
    specified, with blocks delimited by ';'. This will also support the use
    case of [1] without runtime changes.

    For details see the updated Documentation/vm/slub.rst

    [1] https://lore.kernel.org/r/1383cd32-1ddc-4dac-b5f8-9c42282fa81c@codeaurora.org

    [weiyongjun1@huawei.com: make parse_slub_debug_flags() static]
    Link: http://lkml.kernel.org/r/20200702150522.4940-1-weiyongjun1@huawei.com

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Kees Cook
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Jann Horn
    Cc: Roman Gushchin
    Cc: Vijayanand Jitta
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200610163135.17364-2-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • kmalloc cannot allocate memory from HIGHMEM. Allocating large amounts of
    memory currently bypasses the check and will simply leak the memory when
    page_address() returns NULL. To fix this, factor the GFP_SLAB_BUG_MASK
    check out of slab & slub, and call it from kmalloc_order() as well. In
    order to make the code clear, the warning message is put in one place.

    Signed-off-by: Long Li
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200704035027.GA62481@lilong
    Signed-off-by: Linus Torvalds

    Long Li
     

17 Jul, 2020

1 commit

  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

26 Jun, 2020

1 commit

  • According to Christopher Lameter two fixes have been merged for the same
    problem. As far as I can tell, the code does not acquire the list_lock
    and invoke kmalloc(). list_slab_objects() misses an unlock (the
    counterpart to get_map()) and the memory allocated in free_partial()
    isn't used.

    Revert the mentioned commit.

    Link: http://lkml.kernel.org/r/20200618201234.795692-1-bigeasy@linutronix.de
    Fixes: aa456c7aebb14 ("slub: remove kmalloc under list_lock from list_slab_objects() V2")
    Link: https://lkml.kernel.org/r/alpine.DEB.2.22.394.2006181501480.12014@www.lameter.com
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Thomas Gleixner
    Cc: Yu Zhao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     

18 Jun, 2020

1 commit


05 Jun, 2020

1 commit


04 Jun, 2020

2 commits

  • classzone_idx is just different name for high_zoneidx now. So, integrate
    them and add some comment to struct alloc_context in order to reduce
    future confusion about the meaning of this variable.

    The accessor, ac_classzone_idx() is also removed since it isn't needed
    after integration.

    In addition to integration, this patch also renames high_zoneidx to
    highest_zoneidx since it represents more precise meaning.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Acked-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Ye Xiaolong
    Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • syzkaller reports for memory leak when kobject_init_and_add() returns an
    error in the function sysfs_slab_add() [1]

    When this happened, the function kobject_put() is not called for the
    corresponding kobject, which potentially leads to memory leak.

    This patch fixes the issue by calling kobject_put() even if
    kobject_init_and_add() fails.

    [1]
    BUG: memory leak
    unreferenced object 0xffff8880a6d4be88 (size 8):
    comm "syz-executor.3", pid 946, jiffies 4295772514 (age 18.396s)
    hex dump (first 8 bytes):
    70 69 64 5f 33 00 ff ff pid_3...
    backtrace:
    kstrdup+0x35/0x70 mm/util.c:60
    kstrdup_const+0x3d/0x50 mm/util.c:82
    kvasprintf_const+0x112/0x170 lib/kasprintf.c:48
    kobject_set_name_vargs+0x55/0x130 lib/kobject.c:289
    kobject_add_varg lib/kobject.c:384 [inline]
    kobject_init_and_add+0xd8/0x170 lib/kobject.c:473
    sysfs_slab_add+0x1d8/0x290 mm/slub.c:5811
    __kmem_cache_create+0x50a/0x570 mm/slub.c:4384
    create_cache+0x113/0x1e0 mm/slab_common.c:407
    kmem_cache_create_usercopy+0x1a1/0x260 mm/slab_common.c:505
    kmem_cache_create+0xd/0x10 mm/slab_common.c:564
    create_pid_cachep kernel/pid_namespace.c:54 [inline]
    create_pid_namespace kernel/pid_namespace.c:96 [inline]
    copy_pid_ns+0x77c/0x8f0 kernel/pid_namespace.c:148
    create_new_namespaces+0x26b/0xa30 kernel/nsproxy.c:95
    unshare_nsproxy_namespaces+0xa7/0x1e0 kernel/nsproxy.c:229
    ksys_unshare+0x3d2/0x770 kernel/fork.c:2969
    __do_sys_unshare kernel/fork.c:3037 [inline]
    __se_sys_unshare kernel/fork.c:3035 [inline]
    __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3035
    do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:295

    Fixes: 80da026a8e5d ("mm/slub: fix slab double-free in case of duplicate sysfs filename")
    Reported-by: Hulk Robot
    Signed-off-by: Wang Hai
    Signed-off-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200602115033.1054-1-wanghai38@huawei.com
    Signed-off-by: Linus Torvalds

    Wang Hai
     

03 Jun, 2020

5 commits

  • Pull drm updates from Dave Airlie:
    "Highlights:

    - Core DRM had a lot of refactoring around managed drm resources to
    make drivers simpler.

    - Intel Tigerlake support is on by default

    - amdgpu now support p2p PCI buffer sharing and encrypted GPU memory

    Details:

    core:
    - uapi: error out EBUSY when existing master
    - uapi: rework SET/DROP MASTER permission handling
    - remove drm_pci.h
    - drm_pci* are now legacy
    - introduced managed DRM resources
    - subclassing support for drm_framebuffer
    - simple encoder helper
    - edid improvements
    - vblank + writeback documentation improved
    - drm/mm - optimise tree searches
    - port drivers to use devm_drm_dev_alloc

    dma-buf:
    - add flag for p2p buffer support

    mst:
    - ACT timeout improvements
    - remove drm_dp_mst_has_audio
    - don't use 2nd TX slot - spec recommends against it

    bridge:
    - dw-hdmi various improvements
    - chrontel ch7033 support
    - fix stack issues with old gcc

    hdmi:
    - add unpack function for drm infoframe

    fbdev:
    - misc fbdev driver fixes

    i915:
    - uapi: global sseu pinning
    - uapi: OA buffer polling
    - uapi: remove generated perf code
    - uapi: per-engine default property values in sysfs
    - Tigerlake GEN12 enabled.
    - Lots of gem refactoring
    - Tigerlake enablement patches
    - move to drm_device logging
    - Icelake gamma HW readout
    - push MST link retrain to hotplug work
    - bandwidth atomic helpers
    - ICL fixes
    - RPS/GT refactoring
    - Cherryview full-ppgtt support
    - i915 locking guidelines documented
    - require linear fb stride to be 512 multiple on gen9
    - Tigerlake SAGV support

    amdgpu:
    - uapi: encrypted GPU memory handling
    - uapi: add MEM_SYNC IB flag
    - p2p dma-buf support
    - export VRAM dma-bufs
    - FRU chip access support
    - RAS/SR-IOV updates
    - Powerplay locking fixes
    - VCN DPG (powergating) enablement
    - GFX10 clockgating fixes
    - DC fixes
    - GPU reset fixes
    - navi SDMA fix
    - expose FP16 for modesetting
    - DP 1.4 compliance fixes
    - gfx10 soft recovery
    - Improved Critical Thermal Faults handling
    - resizable BAR on gmc10

    amdkfd:
    - uapi: GWS resource management
    - track GPU memory per process
    - report PCI domain in topology

    radeon:
    - safe reg list generator fixes

    nouveau:
    - HD audio fixes on recent systems
    - vGPU detection (fail probe if we're on one, for now)
    - Interlaced mode fixes (mostly avoidance on Turing, which doesn't support it)
    - SVM improvements/fixes
    - NVIDIA format modifier support
    - Misc other fixes.

    adv7511:
    - HDMI SPDIF support

    ast:
    - allocate crtc state size
    - fix double assignment
    - fix suspend

    bochs:
    - drop connector register

    cirrus:
    - move to tiny drivers.

    exynos:
    - fix imported dma-buf mapping
    - enable runtime PM
    - fixes and cleanups

    mediatek:
    - DPI pin mode swap
    - config mipi_tx current/impedance

    lima:
    - devfreq + cooling device support
    - task handling improvements
    - runtime PM support

    pl111:
    - vexpress init improvements
    - fix module auto-load

    rcar-du:
    - DT bindings conversion to YAML
    - Planes zpos sanity check and fix
    - MAINTAINERS entry for LVDS panel driver

    mcde:
    - fix return value

    mgag200:
    - use managed config init

    stm:
    - read endpoints from DT

    vboxvideo:
    - use PCI managed functions
    - drop WC mtrr

    vkms:
    - enable cursor by default

    rockchip:
    - afbc support

    virtio:
    - various cleanups

    qxl:
    - fix cursor notify port

    hisilicon:
    - 128-byte stride alignment fix

    sun4i:
    - improved format handling"

    * tag 'drm-next-2020-06-02' of git://anongit.freedesktop.org/drm/drm: (1401 commits)
    drm/amd/display: Fix potential integer wraparound resulting in a hang
    drm/amd/display: drop cursor position check in atomic test
    drm/amdgpu: fix device attribute node create failed with multi gpu
    drm/nouveau: use correct conflicting framebuffer API
    drm/vblank: Fix -Wformat compile warnings on some arches
    drm/amdgpu: Sync with VM root BO when switching VM to CPU update mode
    drm/amd/display: Handle GPU reset for DC block
    drm/amdgpu: add apu flags (v2)
    drm/amd/powerpay: Disable gfxoff when setting manual mode on picasso and raven
    drm/amdgpu: fix pm sysfs node handling (v2)
    drm/amdgpu: move gpu_info parsing after common early init
    drm/amdgpu: move discovery gfx config fetching
    drm/nouveau/dispnv50: fix runtime pm imbalance on error
    drm/nouveau: fix runtime pm imbalance on error
    drm/nouveau: fix runtime pm imbalance on error
    drm/nouveau/debugfs: fix runtime pm imbalance on error
    drm/nouveau/nouveau/hmm: fix migrate zero page to GPU
    drm/nouveau/nouveau/hmm: fix nouveau_dmem_chunk allocations
    drm/nouveau/kms/nv50-: Share DP SST mode_valid() handling with MST
    drm/nouveau/kms/nv50-: Move 8BPC limit for MST into nv50_mstc_get_modes()
    ...

    Linus Torvalds
     
  • There is no need to copy SLUB_STATS items from root memcg cache to new
    memcg cache copies. Doing so could result in stack overruns because the
    store function only accepts 0 to clear the stat and returns an error for
    everything else while the show method would print out the whole stat.

    Then, the mismatch of the lengths returns from show and store methods
    happens in memcg_propagate_slab_attrs():

    else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf))
    buf = mbuf;

    max_attr_size is only 2 from slab_attr_store(), then, it uses mbuf[64]
    in show_stat() later where a bounch of sprintf() would overrun the stack
    variable. Fix it by always allocating a page of buffer to be used in
    show_stat() if SLUB_STATS=y which should only be used for debug purpose.

    # echo 1 > /sys/kernel/slab/fs_cache/shrink
    BUG: KASAN: stack-out-of-bounds in number+0x421/0x6e0
    Write of size 1 at addr ffffc900256cfde0 by task kworker/76:0/53251

    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
    Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
    Call Trace:
    number+0x421/0x6e0
    vsnprintf+0x451/0x8e0
    sprintf+0x9e/0xd0
    show_stat+0x124/0x1d0
    alloc_slowpath_show+0x13/0x20
    __kmem_cache_create+0x47a/0x6b0

    addr ffffc900256cfde0 is located in stack of task kworker/76:0/53251 at offset 0 in frame:
    process_one_work+0x0/0xb90

    this frame has 1 object:
    [32, 72) 'lockdep_map'

    Memory state around the buggy address:
    ffffc900256cfc80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffffc900256cfd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    >ffffc900256cfd80: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
    ^
    ffffc900256cfe00: 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00 00
    ffffc900256cfe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ==================================================================
    Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __kmem_cache_create+0x6ac/0x6b0
    Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
    Call Trace:
    __kmem_cache_create+0x6ac/0x6b0

    Fixes: 107dab5c92d5 ("slub: slub-specific propagation changes")
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200429222356.4322-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • list_slab_objects() is called when a slab is destroyed and there are
    objects still left to list the objects in the syslog. This is a pretty
    rare event.

    And there it seems we take the list_lock and call kmalloc while holding
    that lock.

    Perform the allocation in free_partial() before the list_lock is taken.

    Fixes: bbd7d57bfe852d9788bae5fb171c7edb4021d8ac ("slub: Potential stack overflow")
    Signed-off-by: Christopher Lameter
    Signed-off-by: Andrew Morton
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Tetsuo Handa
    Cc: Yu Zhao
    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2002031721250.1668@www.lameter.com
    Signed-off-by: Linus Torvalds

    Christopher Lameter
     
  • I came across some unnecessary uevents once again which reminded me
    this. The patch seems to be lost in the leaves of the original
    discussion [1], so resending.

    [1] https://lore.kernel.org/r/alpine.DEB.2.21.2001281813130.745@www.lameter.com

    Kmem caches are internal kernel structures so it is strange that
    userspace notifiers would be needed. And I am not aware of any use of
    these notifiers. These notifiers may just exist because in the initial
    slub release the sysfs code was copied from another subsystem.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Michal Koutný
    Acked-by: David Rientjes
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200423115721.19821-1-mkoutny@suse.com
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The slub_debug is able to fix the corrupted slab freelist/page.
    However, alloc_debug_processing() only checks the validity of current
    and next freepointer during allocation path. As a result, once some
    objects have their freepointers corrupted, deactivate_slab() may lead to
    page fault.

    Below is from a test kernel module when 'slub_debug=PUF,kmalloc-128
    slub_nomerge'. The test kernel corrupts the freepointer of one free
    object on purpose. Unfortunately, deactivate_slab() does not detect it
    when iterating the freechain.

    BUG: unable to handle page fault for address: 00000000123456f8
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    ... ...
    RIP: 0010:deactivate_slab.isra.92+0xed/0x490
    ... ...
    Call Trace:
    ___slab_alloc+0x536/0x570
    __slab_alloc+0x17/0x30
    __kmalloc+0x1d9/0x200
    ext4_htree_store_dirent+0x30/0xf0
    htree_dirblock_to_tree+0xcb/0x1c0
    ext4_htree_fill_tree+0x1bc/0x2d0
    ext4_readdir+0x54f/0x920
    iterate_dir+0x88/0x190
    __x64_sys_getdents+0xa6/0x140
    do_syscall_64+0x49/0x170
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Therefore, this patch adds extra consistency check in deactivate_slab().
    Once an object's freepointer is corrupted, all following objects
    starting at this object are isolated.

    [akpm@linux-foundation.org: fix build with CONFIG_SLAB_DEBUG=n]
    Signed-off-by: Dongli Zhang
    Signed-off-by: Andrew Morton
    Cc: Joe Jin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200331031450.12182-1-dongli.zhang@oracle.com
    Signed-off-by: Linus Torvalds

    Dongli Zhang