01 Oct, 2020

1 commit

  • [ Upstream commit cbfc35a48609ceac978791e3ab9dde0c01f8cb20 ]

    In a couple of places in the slub memory allocator, the code uses
    "s->offset" as a check to see if the free pointer is put right after the
    object. That check is no longer true with commit 3202fa62fb43 ("slub:
    relocate freelist pointer to middle of object").

    As a result, echoing "1" into the validate sysfs file, e.g. of dentry,
    may cause a bunch of "Freepointer corrupt" error reports like the
    following to appear with the system in panic afterwards.

    =============================================================================
    BUG dentry(666:pmcd.service) (Tainted: G B): Freepointer corrupt
    -----------------------------------------------------------------------------

    To fix it, use the check "s->offset == s->inuse" in the new helper
    function freeptr_outside_object() instead. Also add another helper
    function get_info_end() to return the end of info block (inuse + free
    pointer if not overlapping with object).

    Fixes: 3202fa62fb43 ("slub: relocate freelist pointer to middle of object")
    Signed-off-by: Waiman Long
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Kees Cook
    Acked-by: Rafael Aquini
    Cc: Christoph Lameter
    Cc: Vitaly Nikolenko
    Cc: Silvio Cesare
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Markus Elfring
    Cc: Changbin Du
    Link: http://lkml.kernel.org/r/20200429135328.26976-1-longman@redhat.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Waiman Long
     

10 Sep, 2020

1 commit

  • commit dc07a728d49cf025f5da2c31add438d839d076c0 upstream.

    Commit 52f23478081ae0 ("mm/slub.c: fix corrupted freechain in
    deactivate_slab()") suffered an update when picked up from LKML [1].

    Specifically, relocating 'freelist = NULL' into 'freelist_corrupted()'
    created a no-op statement. Fix it by sticking to the behavior intended
    in the original patch [1]. In addition, make freelist_corrupted()
    immune to passing NULL instead of &freelist.

    The issue has been spotted via static analysis and code review.

    [1] https://lore.kernel.org/linux-mm/20200331031450.12182-1-dongli.zhang@oracle.com/

    Fixes: 52f23478081ae0 ("mm/slub.c: fix corrupted freechain in deactivate_slab()")
    Signed-off-by: Eugeniu Rosca
    Signed-off-by: Andrew Morton
    Cc: Dongli Zhang
    Cc: Joe Jin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Link: https://lkml.kernel.org/r/20200824130643.10291-1-erosca@de.adit-jv.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Eugeniu Rosca
     

09 Jul, 2020

2 commits

  • [ Upstream commit a68ee0573991e90af2f1785db309206408bad3e5 ]

    There is no need to copy SLUB_STATS items from root memcg cache to new
    memcg cache copies. Doing so could result in stack overruns because the
    store function only accepts 0 to clear the stat and returns an error for
    everything else while the show method would print out the whole stat.

    Then, the mismatch of the lengths returns from show and store methods
    happens in memcg_propagate_slab_attrs():

    else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf))
    buf = mbuf;

    max_attr_size is only 2 from slab_attr_store(), then, it uses mbuf[64]
    in show_stat() later where a bounch of sprintf() would overrun the stack
    variable. Fix it by always allocating a page of buffer to be used in
    show_stat() if SLUB_STATS=y which should only be used for debug purpose.

    # echo 1 > /sys/kernel/slab/fs_cache/shrink
    BUG: KASAN: stack-out-of-bounds in number+0x421/0x6e0
    Write of size 1 at addr ffffc900256cfde0 by task kworker/76:0/53251

    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
    Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
    Call Trace:
    number+0x421/0x6e0
    vsnprintf+0x451/0x8e0
    sprintf+0x9e/0xd0
    show_stat+0x124/0x1d0
    alloc_slowpath_show+0x13/0x20
    __kmem_cache_create+0x47a/0x6b0

    addr ffffc900256cfde0 is located in stack of task kworker/76:0/53251 at offset 0 in frame:
    process_one_work+0x0/0xb90

    this frame has 1 object:
    [32, 72) 'lockdep_map'

    Memory state around the buggy address:
    ffffc900256cfc80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffffc900256cfd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    >ffffc900256cfd80: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
    ^
    ffffc900256cfe00: 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00 00
    ffffc900256cfe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ==================================================================
    Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __kmem_cache_create+0x6ac/0x6b0
    Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
    Call Trace:
    __kmem_cache_create+0x6ac/0x6b0

    Fixes: 107dab5c92d5 ("slub: slub-specific propagation changes")
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200429222356.4322-1-cai@lca.pw
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     
  • [ Upstream commit 52f23478081ae0dcdb95d1650ea1e7d52d586829 ]

    The slub_debug is able to fix the corrupted slab freelist/page.
    However, alloc_debug_processing() only checks the validity of current
    and next freepointer during allocation path. As a result, once some
    objects have their freepointers corrupted, deactivate_slab() may lead to
    page fault.

    Below is from a test kernel module when 'slub_debug=PUF,kmalloc-128
    slub_nomerge'. The test kernel corrupts the freepointer of one free
    object on purpose. Unfortunately, deactivate_slab() does not detect it
    when iterating the freechain.

    BUG: unable to handle page fault for address: 00000000123456f8
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    ... ...
    RIP: 0010:deactivate_slab.isra.92+0xed/0x490
    ... ...
    Call Trace:
    ___slab_alloc+0x536/0x570
    __slab_alloc+0x17/0x30
    __kmalloc+0x1d9/0x200
    ext4_htree_store_dirent+0x30/0xf0
    htree_dirblock_to_tree+0xcb/0x1c0
    ext4_htree_fill_tree+0x1bc/0x2d0
    ext4_readdir+0x54f/0x920
    iterate_dir+0x88/0x190
    __x64_sys_getdents+0xa6/0x140
    do_syscall_64+0x49/0x170
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Therefore, this patch adds extra consistency check in deactivate_slab().
    Once an object's freepointer is corrupted, all following objects
    starting at this object are isolated.

    [akpm@linux-foundation.org: fix build with CONFIG_SLAB_DEBUG=n]
    Signed-off-by: Dongli Zhang
    Signed-off-by: Andrew Morton
    Cc: Joe Jin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200331031450.12182-1-dongli.zhang@oracle.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Dongli Zhang
     

17 Jun, 2020

1 commit

  • commit dde3c6b72a16c2db826f54b2d49bdea26c3534a2 upstream.

    syzkaller reports for memory leak when kobject_init_and_add() returns an
    error in the function sysfs_slab_add() [1]

    When this happened, the function kobject_put() is not called for the
    corresponding kobject, which potentially leads to memory leak.

    This patch fixes the issue by calling kobject_put() even if
    kobject_init_and_add() fails.

    [1]
    BUG: memory leak
    unreferenced object 0xffff8880a6d4be88 (size 8):
    comm "syz-executor.3", pid 946, jiffies 4295772514 (age 18.396s)
    hex dump (first 8 bytes):
    70 69 64 5f 33 00 ff ff pid_3...
    backtrace:
    kstrdup+0x35/0x70 mm/util.c:60
    kstrdup_const+0x3d/0x50 mm/util.c:82
    kvasprintf_const+0x112/0x170 lib/kasprintf.c:48
    kobject_set_name_vargs+0x55/0x130 lib/kobject.c:289
    kobject_add_varg lib/kobject.c:384 [inline]
    kobject_init_and_add+0xd8/0x170 lib/kobject.c:473
    sysfs_slab_add+0x1d8/0x290 mm/slub.c:5811
    __kmem_cache_create+0x50a/0x570 mm/slub.c:4384
    create_cache+0x113/0x1e0 mm/slab_common.c:407
    kmem_cache_create_usercopy+0x1a1/0x260 mm/slab_common.c:505
    kmem_cache_create+0xd/0x10 mm/slab_common.c:564
    create_pid_cachep kernel/pid_namespace.c:54 [inline]
    create_pid_namespace kernel/pid_namespace.c:96 [inline]
    copy_pid_ns+0x77c/0x8f0 kernel/pid_namespace.c:148
    create_new_namespaces+0x26b/0xa30 kernel/nsproxy.c:95
    unshare_nsproxy_namespaces+0xa7/0x1e0 kernel/nsproxy.c:229
    ksys_unshare+0x3d2/0x770 kernel/fork.c:2969
    __do_sys_unshare kernel/fork.c:3037 [inline]
    __se_sys_unshare kernel/fork.c:3035 [inline]
    __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3035
    do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:295

    Fixes: 80da026a8e5d ("mm/slub: fix slab double-free in case of duplicate sysfs filename")
    Reported-by: Hulk Robot
    Signed-off-by: Wang Hai
    Signed-off-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200602115033.1054-1-wanghai38@huawei.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wang Hai
     

13 Apr, 2020

1 commit

  • commit 1ad53d9fa3f6168ebcf48a50e08b170432da2257 upstream.

    Under CONFIG_SLAB_FREELIST_HARDENED=y, the obfuscation was relatively weak
    in that the ptr and ptr address were usually so close that the first XOR
    would result in an almost entirely 0-byte value[1], leaving most of the
    "secret" number ultimately being stored after the third XOR. A single
    blind memory content exposure of the freelist was generally sufficient to
    learn the secret.

    Add a swab() call to mix bits a little more. This is a cheap way (1
    cycle) to make attacks need more than a single exposure to learn the
    secret (or to know _where_ the exposure is in memory).

    kmalloc-32 freelist walk, before:

    ptr ptr_addr stored value secret
    ffff90c22e019020@ffff90c22e019000 is 86528eb656b3b5bd (86528eb656b3b59d)
    ffff90c22e019040@ffff90c22e019020 is 86528eb656b3b5fd (86528eb656b3b59d)
    ffff90c22e019060@ffff90c22e019040 is 86528eb656b3b5bd (86528eb656b3b59d)
    ffff90c22e019080@ffff90c22e019060 is 86528eb656b3b57d (86528eb656b3b59d)
    ffff90c22e0190a0@ffff90c22e019080 is 86528eb656b3b5bd (86528eb656b3b59d)
    ...

    after:

    ptr ptr_addr stored value secret
    ffff9eed6e019020@ffff9eed6e019000 is 793d1135d52cda42 (86528eb656b3b59d)
    ffff9eed6e019040@ffff9eed6e019020 is 593d1135d52cda22 (86528eb656b3b59d)
    ffff9eed6e019060@ffff9eed6e019040 is 393d1135d52cda02 (86528eb656b3b59d)
    ffff9eed6e019080@ffff9eed6e019060 is 193d1135d52cdae2 (86528eb656b3b59d)
    ffff9eed6e0190a0@ffff9eed6e019080 is f93d1135d52cdac2 (86528eb656b3b59d)

    [1] https://blog.infosectcbr.com.au/2020/03/weaknesses-in-linux-kernel-heap.html

    Fixes: 2482ddec670f ("mm: add SLUB free list pointer obfuscation")
    Reported-by: Silvio Cesare
    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Link: http://lkml.kernel.org/r/202003051623.AF4F8CB@keescook
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

25 Mar, 2020

2 commits

  • commit 0715e6c516f106ed553828a671d30ad9a3431536 upstream.

    Sachin reports [1] a crash in SLUB __slab_alloc():

    BUG: Kernel NULL pointer dereference on read at 0x000073b0
    Faulting instruction address: 0xc0000000003d55f4
    Oops: Kernel access of bad area, sig: 11 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in:
    CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
    NIP: c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000
    REGS: c0000008b37836d0 TRAP: 0300 Not tainted (5.6.0-rc2-next-20200218-autotest)
    MSR: 8000000000009033 CR: 24004844 XER: 00000000
    CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
    GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500
    GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620
    GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000
    GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000
    GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002
    GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122
    GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8
    GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180
    NIP ___slab_alloc+0x1f4/0x760
    LR __slab_alloc+0x34/0x60
    Call Trace:
    ___slab_alloc+0x334/0x760 (unreliable)
    __slab_alloc+0x34/0x60
    __kmalloc_node+0x110/0x490
    kvmalloc_node+0x58/0x110
    mem_cgroup_css_online+0x108/0x270
    online_css+0x48/0xd0
    cgroup_apply_control_enable+0x2ec/0x4d0
    cgroup_mkdir+0x228/0x5f0
    kernfs_iop_mkdir+0x90/0xf0
    vfs_mkdir+0x110/0x230
    do_mkdirat+0xb0/0x1a0
    system_call+0x5c/0x68

    This is a PowerPC platform with following NUMA topology:

    available: 2 nodes (0-1)
    node 0 cpus:
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
    node 1 size: 35247 MB
    node 1 free: 30907 MB
    node distances:
    node 0 1
    0: 10 40
    1: 40 10

    possible numa nodes: 0-31

    This only happens with a mmotm patch "mm/memcontrol.c: allocate
    shrinker_map on appropriate NUMA node" [2] which effectively calls
    kmalloc_node for each possible node. SLUB however only allocates
    kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on
    node_to_mem_node to return such valid node for other nodes since commit
    a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating
    on memoryless node"). This is however not true in this configuration
    where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31,
    thus it contains zeroes and get_partial() ends up accessing
    non-allocated kmem_cache_node.

    A related issue was reported by Bharata (originally by Ramachandran) [3]
    where a similar PowerPC configuration, but with mainline kernel without
    patch [2] ends up allocating large amounts of pages by kmalloc-1k
    kmalloc-512. This seems to have the same underlying issue with
    node_to_mem_node() not behaving as expected, and might probably also
    lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4].

    This patch should fix both issues by not relying on node_to_mem_node()
    anymore and instead simply falling back to NUMA_NO_NODE, when
    kmalloc_node(node) is attempted for a node that's not online, or has no
    usable memory. The "usable memory" condition is also changed from
    node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly
    the condition that SLUB uses to allocate kmem_cache_node structures.
    The check in get_partial() is removed completely, as the checks in
    ___slab_alloc() are now sufficient to prevent get_partial() being
    reached with an invalid node.

    [1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
    [2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/
    [3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/
    [4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/

    Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node")
    Reported-by: Sachin Sant
    Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Tested-by: Sachin Sant
    Tested-by: Bharata B Rao
    Reviewed-by: Srikar Dronamraju
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Christopher Lameter
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Kirill Tkhai
    Cc: Vlastimil Babka
    Cc: Nathan Lynch
    Cc:
    Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz
    Debugged-by: Srikar Dronamraju
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit 5076190daded2197f62fe92cf69674488be44175 upstream.

    This is just a cleanup addition to Jann's fix to properly update the
    transaction ID for the slub slowpath in commit fd4d9c7d0c71 ("mm: slub:
    add missing TID bump..").

    The transaction ID is what protects us against any concurrent accesses,
    but we should really also make sure to make the 'freelist' comparison
    itself always use the same freelist value that we then used as the new
    next free pointer.

    Jann points out that if we do all of this carefully, we could skip the
    transaction ID update for all the paths that only remove entries from
    the lists, and only update the TID when adding entries (to avoid the ABA
    issue with cmpxchg and list handling re-adding a previously seen value).

    But this patch just does the "make sure to cmpxchg the same value we
    used" rather than then try to be clever.

    Acked-by: Jann Horn
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

21 Mar, 2020

1 commit

  • commit fd4d9c7d0c71866ec0c2825189ebd2ce35bd95b8 upstream.

    When kmem_cache_alloc_bulk() attempts to allocate N objects from a percpu
    freelist of length M, and N > M > 0, it will first remove the M elements
    from the percpu freelist, then call ___slab_alloc() to allocate the next
    element and repopulate the percpu freelist. ___slab_alloc() can re-enable
    IRQs via allocate_slab(), so the TID must be bumped before ___slab_alloc()
    to properly commit the freelist head change.

    Fix it by unconditionally bumping c->tid when entering the slowpath.

    Cc: stable@vger.kernel.org
    Fixes: ebe909e0fdb3 ("slub: improve bulk alloc strategy")
    Signed-off-by: Jann Horn
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

23 Jan, 2020

1 commit

  • commit 8e57f8acbbd121ecfb0c9dc13b8b030f86c6bd3b upstream.

    Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
    debugging") has introduced a static key to reduce overhead when
    debug_pagealloc is compiled in but not enabled. It relied on the
    assumption that jump_label_init() is called before parse_early_param()
    as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
    it is safe to enable the static key.

    However, it turns out multiple architectures call parse_early_param()
    earlier from their setup_arch(). x86 also calls jump_label_init() even
    earlier, so no issue was found while testing the commit, but same is not
    true for e.g. ppc64 and s390 where the kernel would not boot with
    debug_pagealloc=on as found by our QA.

    To fix this without tricky changes to init code of multiple
    architectures, this patch partially reverts the static key conversion
    from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
    code) of debug_pagealloc_enabled() will again test a simple bool
    variable. Fastpath mm code is converted to a new
    debug_pagealloc_enabled_static() variant that relies on the static key,
    which is enabled in a well-defined point in mm_init() where it's
    guaranteed that jump_label_init() has been called, regardless of
    architecture.

    [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
    Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
    Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Stephen Rothwell
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Qian Cai
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

16 Nov, 2019

1 commit

  • Commit 1b7e816fc80e ("mm: slub: Fix slab walking for init_on_free")
    fixed one problem with the slab walking but missed a key detail: When
    walking the list, the head and tail pointers need to be updated since we
    end up reversing the list as a result. Without doing this, bulk free is
    broken.

    One way this is exposed is a NULL pointer with slub_debug=F:

    =============================================================================
    BUG skbuff_head_cache (Tainted: G T): Object already free
    -----------------------------------------------------------------------------

    INFO: Slab 0x000000000d2d2f8f objects=16 used=3 fp=0x0000000064309071 flags=0x3fff00000000201
    BUG: kernel NULL pointer dereference, address: 0000000000000000
    Oops: 0000 [#1] PREEMPT SMP PTI
    RIP: 0010:print_trailer+0x70/0x1d5
    Call Trace:

    free_debug_processing.cold.37+0xc9/0x149
    __slab_free+0x22a/0x3d0
    kmem_cache_free_bulk+0x415/0x420
    __kfree_skb_flush+0x30/0x40
    net_rx_action+0x2dd/0x480
    __do_softirq+0xf0/0x246
    irq_exit+0x93/0xb0
    do_IRQ+0xa0/0x110
    common_interrupt+0xf/0xf

    Given we're now almost identical to the existing debugging code which
    correctly walks the list, combine with that.

    Link: https://lkml.kernel.org/r/20191104170303.GA50361@gandi.net
    Link: http://lkml.kernel.org/r/20191106222208.26815-1-labbott@redhat.com
    Fixes: 1b7e816fc80e ("mm: slub: Fix slab walking for init_on_free")
    Signed-off-by: Laura Abbott
    Reported-by: Thibaut Sautereau
    Acked-by: David Rientjes
    Tested-by: Alexander Potapenko
    Acked-by: Alexander Potapenko
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Vlastimil Babka
    Cc:
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     

15 Oct, 2019

2 commits

  • slab_alloc_node() already zeroed out the freelist pointer if
    init_on_free was on. Thibaut Sautereau noticed that the same needs to
    be done for kmem_cache_alloc_bulk(), which performs the allocations
    separately.

    kmem_cache_alloc_bulk() is currently used in two places in the kernel,
    so this change is unlikely to have a major performance impact.

    SLAB doesn't require a similar change, as auto-initialization makes the
    allocator store the freelist pointers off-slab.

    Link: http://lkml.kernel.org/r/20191007091605.30530-1-glider@google.com
    Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
    Signed-off-by: Alexander Potapenko
    Reported-by: Thibaut Sautereau
    Reported-by: Kees Cook
    Cc: Christoph Lameter
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • A long time ago we fixed a similar deadlock in show_slab_objects() [1].
    However, it is apparently due to the commits like 01fb58bcba63 ("slab:
    remove synchronous synchronize_sched() from memcg cache deactivation
    path") and 03afc0e25f7f ("slab: get_online_mems for
    kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by
    just reading files in /sys/kernel/slab which will generate a lockdep
    splat below.

    Since the "mem_hotplug_lock" here is only to obtain a stable online node
    mask while racing with NUMA node hotplug, in the worst case, the results
    may me miscalculated while doing NUMA node hotplug, but they shall be
    corrected by later reads of the same files.

    WARNING: possible circular locking dependency detected
    ------------------------------------------------------
    cat/5224 is trying to acquire lock:
    ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at:
    show_slab_objects+0x94/0x3a8

    but task is already holding lock:
    b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (kn->count#45){++++}:
    lock_acquire+0x31c/0x360
    __kernfs_remove+0x290/0x490
    kernfs_remove+0x30/0x44
    sysfs_remove_dir+0x70/0x88
    kobject_del+0x50/0xb0
    sysfs_slab_unlink+0x2c/0x38
    shutdown_cache+0xa0/0xf0
    kmemcg_cache_shutdown_fn+0x1c/0x34
    kmemcg_workfn+0x44/0x64
    process_one_work+0x4f4/0x950
    worker_thread+0x390/0x4bc
    kthread+0x1cc/0x1e8
    ret_from_fork+0x10/0x18

    -> #1 (slab_mutex){+.+.}:
    lock_acquire+0x31c/0x360
    __mutex_lock_common+0x16c/0xf78
    mutex_lock_nested+0x40/0x50
    memcg_create_kmem_cache+0x38/0x16c
    memcg_kmem_cache_create_func+0x3c/0x70
    process_one_work+0x4f4/0x950
    worker_thread+0x390/0x4bc
    kthread+0x1cc/0x1e8
    ret_from_fork+0x10/0x18

    -> #0 (mem_hotplug_lock.rw_sem){++++}:
    validate_chain+0xd10/0x2bcc
    __lock_acquire+0x7f4/0xb8c
    lock_acquire+0x31c/0x360
    get_online_mems+0x54/0x150
    show_slab_objects+0x94/0x3a8
    total_objects_show+0x28/0x34
    slab_attr_show+0x38/0x54
    sysfs_kf_seq_show+0x198/0x2d4
    kernfs_seq_show+0xa4/0xcc
    seq_read+0x30c/0x8a8
    kernfs_fop_read+0xa8/0x314
    __vfs_read+0x88/0x20c
    vfs_read+0xd8/0x10c
    ksys_read+0xb0/0x120
    __arm64_sys_read+0x54/0x88
    el0_svc_handler+0x170/0x240
    el0_svc+0x8/0xc

    other info that might help us debug this:

    Chain exists of:
    mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(kn->count#45);
    lock(slab_mutex);
    lock(kn->count#45);
    lock(mem_hotplug_lock.rw_sem);

    *** DEADLOCK ***

    3 locks held by cat/5224:
    #0: 9eff00095b14b2a0 (&p->lock){+.+.}, at: seq_read+0x4c/0x8a8
    #1: 0eff008997041480 (&of->mutex){+.+.}, at: kernfs_seq_start+0x34/0xf0
    #2: b8ff009693eee398 (kn->count#45){++++}, at:
    kernfs_seq_start+0x44/0xf0

    stack backtrace:
    Call trace:
    dump_backtrace+0x0/0x248
    show_stack+0x20/0x2c
    dump_stack+0xd0/0x140
    print_circular_bug+0x368/0x380
    check_noncircular+0x248/0x250
    validate_chain+0xd10/0x2bcc
    __lock_acquire+0x7f4/0xb8c
    lock_acquire+0x31c/0x360
    get_online_mems+0x54/0x150
    show_slab_objects+0x94/0x3a8
    total_objects_show+0x28/0x34
    slab_attr_show+0x38/0x54
    sysfs_kf_seq_show+0x198/0x2d4
    kernfs_seq_show+0xa4/0xcc
    seq_read+0x30c/0x8a8
    kernfs_fop_read+0xa8/0x314
    __vfs_read+0x88/0x20c
    vfs_read+0xd8/0x10c
    ksys_read+0xb0/0x120
    __arm64_sys_read+0x54/0x88
    el0_svc_handler+0x170/0x240
    el0_svc+0x8/0xc

    I think it is important to mention that this doesn't expose the
    show_slab_objects to use-after-free. There is only a single path that
    might really race here and that is the slab hotplug notifier callback
    __kmem_cache_shrink (via slab_mem_going_offline_callback) but that path
    doesn't really destroy kmem_cache_node data structures.

    [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.0/02850.html

    [akpm@linux-foundation.org: add comment explaining why we don't need mem_hotplug_lock]
    Link: http://lkml.kernel.org/r/1570192309-10132-1-git-send-email-cai@lca.pw
    Fixes: 01fb58bcba63 ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path")
    Fixes: 03afc0e25f7f ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}")
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

08 Oct, 2019

1 commit

  • Patch series "guarantee natural alignment for kmalloc()", v2.

    This patch (of 2):

    SLOB currently doesn't account its pages at all, so in /proc/meminfo the
    Slab field shows zero. Modifying a counter on page allocation and
    freeing should be acceptable even for the small system scenarios SLOB is
    intended for. Since reclaimable caches are not separated in SLOB,
    account everything as unreclaimable.

    SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
    larger than order-1 page, that are passed directly to the page
    allocator. As they also don't appear in /proc/slabinfo, it might look
    like a memory leak. For consistency, account them as well. (SLAB
    doesn't actually use page allocator directly, so no change there).

    Ideally SLOB and SLUB would be handled in separate patches, but due to
    the shared kmalloc_order() function and different kfree()
    implementations, it's easier to patch both at once to prevent
    inconsistencies.

    Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Ming Lei
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: "Darrick J . Wong"
    Cc: Christoph Hellwig
    Cc: James Bottomley
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

25 Sep, 2019

3 commits

  • Patch series "Make working with compound pages easier", v2.

    These three patches add three helpers and convert the appropriate
    places to use them.

    This patch (of 3):

    It's unnecessarily hard to find out the size of a potentially huge page.
    Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).

    Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • tid_to_cpu() and tid_to_event() are only used in note_cmpxchg_failure()
    when SLUB_DEBUG_CMPXCHG=y, so when SLUB_DEBUG_CMPXCHG=n by default, Clang
    will complain that those unused functions.

    Link: http://lkml.kernel.org/r/1568752232-5094-1-git-send-email-cai@lca.pw
    Signed-off-by: Qian Cai
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • Currently, a value of '1" is written to /sys/kernel/slab//shrink
    file to shrink the slab by flushing out all the per-cpu slabs and free
    slabs in partial lists. This can be useful to squeeze out a bit more
    memory under extreme condition as well as making the active object counts
    in /proc/slabinfo more accurate.

    This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
    option is usually not enabled and "slub_memcg_sysfs=1" not set. Even if
    memcg sysfs is turned on, it is too cumbersome and impractical to manage
    all those per-memcg sysfs files in a real production system.

    So there is no practical way to shrink memcg caches. Fix this by enabling
    a proper write to the shrink sysfs file of the root cache to scan all the
    available memcg caches and shrink them as well. For a non-root memcg
    cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is on), only that
    cache will be shrunk when written.

    On a 2-socket 64-core 256-thread arm64 system with 64k page after
    a parallel kernel build, the the amount of memory occupied by slabs
    before shrinking slabs were:

    # grep task_struct /proc/slabinfo
    task_struct 53137 53192 4288 61 4 : tunables 0 0
    0 : slabdata 872 872 0
    # grep "^S[lRU]" /proc/meminfo
    Slab: 3936832 kB
    SReclaimable: 399104 kB
    SUnreclaim: 3537728 kB

    After shrinking slabs (by echoing "1" to all shrink files):

    # grep "^S[lRU]" /proc/meminfo
    Slab: 1356288 kB
    SReclaimable: 263296 kB
    SUnreclaim: 1092992 kB
    # grep task_struct /proc/slabinfo
    task_struct 2764 6832 4288 61 4 : tunables 0 0
    0 : slabdata 112 112 0

    Link: http://lkml.kernel.org/r/20190723151445.7385-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Acked-by: Roman Gushchin
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     

01 Aug, 2019

1 commit

  • To properly clear the slab on free with slab_want_init_on_free, we walk
    the list of free objects using get_freepointer/set_freepointer.

    The value we get from get_freepointer may not be valid. This isn't an
    issue since an actual value will get written later but this means
    there's a chance of triggering a bug if we use this value with
    set_freepointer:

    kernel BUG at mm/slub.c:306!
    invalid opcode: 0000 [#1] PREEMPT PTI
    CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-05754-g6471384a #4
    RIP: 0010:kfree+0x58a/0x5c0
    Code: 48 83 05 78 37 51 02 01 0f 0b 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 d6 37 51 02 01 0b 48 83 05 d4 37 51 02 01 48 83 05 d4 37 51 02 01 48 83 05 d4
    RSP: 0000:ffffffff82603d90 EFLAGS: 00010002
    RAX: ffff8c3976c04320 RBX: ffff8c3976c04300 RCX: 0000000000000000
    RDX: ffff8c3976c04300 RSI: 0000000000000000 RDI: ffff8c3976c04320
    RBP: ffffffff82603db8 R08: 0000000000000000 R09: 0000000000000000
    R10: ffff8c3976c04320 R11: ffffffff8289e1e0 R12: ffffd52cc8db0100
    R13: ffff8c3976c01a00 R14: ffffffff810f10d4 R15: ffff8c3976c04300
    FS: 0000000000000000(0000) GS:ffffffff8266b000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff8c397ffff000 CR3: 0000000125020000 CR4: 00000000000406b0
    Call Trace:
    apply_wqattrs_prepare+0x154/0x280
    apply_workqueue_attrs_locked+0x4e/0xe0
    apply_workqueue_attrs+0x36/0x60
    alloc_workqueue+0x25a/0x6d0
    workqueue_init_early+0x246/0x348
    start_kernel+0x3c7/0x7ec
    x86_64_start_reservations+0x40/0x49
    x86_64_start_kernel+0xda/0xe4
    secondary_startup_64+0xb6/0xc0
    Modules linked in:
    ---[ end trace f67eb9af4d8d492b ]---

    Fix this by ensuring the value we set with set_freepointer is either NULL
    or another value in the chain.

    Reported-by: kernel test robot
    Signed-off-by: Laura Abbott
    Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
    Reviewed-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Laura Abbott
     

13 Jul, 2019

7 commits

  • Patch series "add init_on_alloc/init_on_free boot options", v10.

    Provide init_on_alloc and init_on_free boot options.

    These are aimed at preventing possible information leaks and making the
    control-flow bugs that depend on uninitialized values more deterministic.

    Enabling either of the options guarantees that the memory returned by the
    page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
    isn't supported at the moment, as its emulation of kmem caches complicates
    handling of SLAB_TYPESAFE_BY_RCU caches correctly.

    Enabling init_on_free also guarantees that pages and heap objects are
    initialized right after they're freed, so it won't be possible to access
    stale data by using a dangling pointer.

    As suggested by Michal Hocko, right now we don't let the heap users to
    disable initialization for certain allocations. There's not enough
    evidence that doing so can speed up real-life cases, and introducing ways
    to opt-out may result in things going out of control.

    This patch (of 2):

    The new options are needed to prevent possible information leaks and make
    control-flow bugs that depend on uninitialized values more deterministic.

    This is expected to be on-by-default on Android and Chrome OS. And it
    gives the opportunity for anyone else to use it under distros too via the
    boot args. (The init_on_free feature is regularly requested by folks
    where memory forensics is included in their threat models.)

    init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
    objects with zeroes. Initialization is done at allocation time at the
    places where checks for __GFP_ZERO are performed.

    init_on_free=1 makes the kernel initialize freed pages and heap objects
    with zeroes upon their deletion. This helps to ensure sensitive data
    doesn't leak via use-after-free accesses.

    Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
    returns zeroed memory. The two exceptions are slab caches with
    constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
    zero-initialized to preserve their semantics.

    Both init_on_alloc and init_on_free default to zero, but those defaults
    can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
    CONFIG_INIT_ON_FREE_DEFAULT_ON.

    If either SLUB poisoning or page poisoning is enabled, those options take
    precedence over init_on_alloc and init_on_free: initialization is only
    applied to unpoisoned allocations.

    Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:

    hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
    hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)

    Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
    Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
    Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
    Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)

    The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
    is within the standard error.

    The new features are also going to pave the way for hardware memory
    tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
    hooks to set the tags for heap objects. With MTE, tagging will have the
    same cost as memory initialization.

    Although init_on_free is rather costly, there are paranoid use-cases where
    in-memory data lifetime is desired to be minimized. There are various
    arguments for/against the realism of the associated threat models, but
    given that we'll need the infrastructure for MTE anyway, and there are
    people who want wipe-on-free behavior no matter what the performance cost,
    it seems reasonable to include it in this series.

    [glider@google.com: v8]
    Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
    [glider@google.com: v10]
    Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
    Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
    Signed-off-by: Alexander Potapenko
    Acked-by: Kees Cook
    Acked-by: Michal Hocko [page and dmapool parts
    Acked-by: James Morris ]
    Cc: Christoph Lameter
    Cc: Masahiro Yamada
    Cc: "Serge E. Hallyn"
    Cc: Nick Desaulniers
    Cc: Kostya Serebryany
    Cc: Dmitry Vyukov
    Cc: Sandeep Patil
    Cc: Laura Abbott
    Cc: Randy Dunlap
    Cc: Jann Horn
    Cc: Mark Rutland
    Cc: Marco Elver
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Currently the page accounting code is duplicated in SLAB and SLUB
    internals. Let's move it into new (un)charge_slab_page helpers in the
    slab_common.c file. These helpers will be responsible for statistics
    (global and memcg-aware) and memcg charging. So they are replacing direct
    memcg_(un)charge_slab() calls.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Christoph Lameter
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently SLUB uses a work scheduled after an RCU grace period to
    deactivate a non-root kmem_cache. This mechanism can be reused for
    kmem_caches release, but requires generalization for SLAB case.

    Introduce kmemcg_cache_deactivate() function, which calls
    allocator-specific __kmem_cache_deactivate() and schedules execution of
    __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
    context after an rcu grace period.

    Here is the new calling scheme:
    kmemcg_cache_deactivate()
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    __kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific

    instead of:
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    slab_deactivate_memcg_cache_rcu_sched() SLUB-only
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    kmemcg_cache_deact_after_rcu() SLUB-only

    For consistency, all allocator-specific functions start with "__".

    Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "mm: reparent slab memory on cgroup removal", v7.

    # Why do we need this?

    We've noticed that the number of dying cgroups is steadily growing on most
    of our hosts in production. The following investigation revealed an issue
    in the userspace memory reclaim code [1], accounting of kernel stacks [2],
    and also the main reason: slab objects.

    The underlying problem is quite simple: any page charged to a cgroup holds
    a reference to it, so the cgroup can't be reclaimed unless all charged
    pages are gone. If a slab object is actively used by other cgroups, it
    won't be reclaimed, and will prevent the origin cgroup from being
    reclaimed.

    Slab objects, and first of all vfs cache, is shared between cgroups, which
    are using the same underlying fs, and what's even more important, it's
    shared between multiple generations of the same workload. So if something
    is running periodically every time in a new cgroup (like how systemd
    works), we do accumulate multiple dying cgroups.

    Strictly speaking pagecache isn't different here, but there is a key
    difference: we disable protection and apply some extra pressure on LRUs of
    dying cgroups, and these LRUs contain all charged pages. My experiments
    show that with the disabled kernel memory accounting the number of dying
    cgroups stabilizes at a relatively small number (~100, depends on memory
    pressure and cgroup creation rate), and with kernel memory accounting it
    grows pretty steadily up to several thousands.

    Memory cgroups are quite complex and big objects (mostly due to percpu
    stats), so it leads to noticeable memory losses. Memory occupied by dying
    cgroups is measured in hundreds of megabytes. I've even seen a host with
    more than 100Gb of memory wasted for dying cgroups. It leads to a
    degradation of performance with the uptime, and generally limits the usage
    of cgroups.

    My previous attempt [3] to fix the problem by applying extra pressure on
    slab shrinker lists caused a regressions with xfs and ext4, and has been
    reverted [4]. The following attempts to find the right balance [5, 6]
    were not successful.

    So instead of trying to find a maybe non-existing balance, let's do
    reparent accounted slab caches to the parent cgroup on cgroup removal.

    # Implementation approach

    There is however a significant problem with reparenting of slab memory:
    there is no list of charged pages. Some of them are in shrinker lists,
    but not all. Introducing of a new list is really not an option.

    But fortunately there is a way forward: every slab page has a stable
    pointer to the corresponding kmem_cache. So the idea is to reparent
    kmem_caches instead of slab pages.

    It's actually simpler and cheaper, but requires some underlying changes:
    1) Make kmem_caches to hold a single reference to the memory cgroup,
    instead of a separate reference per every slab page.
    2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
    page->kmem_cache->memcg indirection instead. It's used only on
    slab page release, so performance overhead shouldn't be a big issue.
    3) Introduce a refcounter for non-root slab caches. It's required to
    be able to destroy kmem_caches when they become empty and release
    the associated memory cgroup.

    There is a bonus: currently we release all memcg kmem_caches all together
    with the memory cgroup itself. This patchset allows individual
    kmem_caches to be released as soon as they become inactive and free.

    Some additional implementation details are provided in corresponding
    commit messages.

    # Results

    Below is the average number of dying cgroups on two groups of our
    production hosts. They do run some sort of web frontend workload, the
    memory pressure is moderate. As we can see, with the kernel memory
    reparenting the number stabilizes in 60s range; however with the original
    version it grows almost linearly and doesn't show any signs of plateauing.
    The difference in slab and percpu usage between patched and unpatched
    versions also grows linearly. In 7 days it exceeded 200Mb.

    day 0 1 2 3 4 5 6 7
    original 56 362 628 752 1070 1250 1490 1560
    patched 23 46 51 55 60 57 67 69
    mem diff(Mb) 22 74 123 152 164 182 214 241

    # Links

    [1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error")
    [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
    [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
    [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
    [5]: https://lkml.org/lkml/2019/1/28/1865
    [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2

    This patch (of 10):

    Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
    rather than in init_memcg_params().

    Once kmem_cache will hold a reference to the memory cgroup, it will
    simplify the refcounting.

    For non-root kmem_caches memcg_link_cache() is always called before the
    kmem_cache becomes visible to a user, so it's safe.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Waiman Long
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This refactors common code of ksize() between the various allocators into
    slab_common.c: __ksize() is the allocator-specific implementation without
    instrumentation, whereas ksize() includes the required KASAN logic.

    Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.com
    Signed-off-by: Marco Elver
    Acked-by: Christoph Lameter
    Reviewed-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mark Rutland
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marco Elver
     
  • Currently for CONFIG_SLUB, if a memcg kmem cache creation is failed and
    the corresponding root kmem cache has SLAB_PANIC flag, the kernel will
    be crashed. This is unnecessary as the kernel can handle the creation
    failures of memcg kmem caches. Additionally CONFIG_SLAB does not
    implement this behavior. So, to keep the behavior consistent between
    SLAB and SLUB, removing the panic for memcg kmem cache creation
    failures. The root kmem cache creation failure for SLAB_PANIC correctly
    panics for both SLAB and SLUB.

    Link: http://lkml.kernel.org/r/20190619232514.58994-1-shakeelb@google.com
    Reported-by: Dave Hansen
    Signed-off-by: Shakeel Butt
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Roman Gushchin
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • If ',' is not found, kmem_cache_flags() calls strlen() to find the end of
    line. We can do it in a single pass using strchrnul().

    Link: http://lkml.kernel.org/r/20190501053111.7950-1-ynorov@marvell.com
    Signed-off-by: Yury Norov
    Acked-by: Aaron Tomlin
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yury Norov
     

15 May, 2019

4 commits

  • Now frozen slab can only be on the per cpu partial list.

    Link: http://lkml.kernel.org/r/1554022325-11305-1-git-send-email-liu.xiang6@zte.com.cn
    Signed-off-by: Liu Xiang
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liu Xiang
     
  • When CONFIG_SLUB_DEBUG is not enabled, remove_full() is empty.
    While CONFIG_SLUB_DEBUG is enabled, remove_full() can check
    s->flags by itself. So kmem_cache_debug() is useless and
    can be removed.

    Link: http://lkml.kernel.org/r/1552577313-2830-1-git-send-email-liu.xiang6@zte.com.cn
    Signed-off-by: Liu Xiang
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liu Xiang
     
  • Currently we use the page->lru list for maintaining lists of slabs. We
    have a list in the page structure (slab_list) that can be used for this
    purpose. Doing so makes the code cleaner since we are not overloading the
    lru list.

    Use the slab_list instead of the lru list for maintaining lists of slabs.

    Link: http://lkml.kernel.org/r/20190402230545.2929-6-tobin@kernel.org
    Signed-off-by: Tobin C. Harding
    Acked-by: Christoph Lameter
    Reviewed-by: Roman Gushchin
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobin C. Harding
     
  • SLUB allocator makes heavy use of ifdef/endif pre-processor macros. The
    pairing of these statements is at times hard to follow e.g. if the pair
    are further than a screen apart or if there are nested pairs. We can
    reduce cognitive load by adding a comment to the endif statement of form

    #ifdef CONFIG_FOO
    ...
    #endif /* CONFIG_FOO */

    Add comments to endif pre-processor macros if ifdef/endif pair is not
    immediately apparent.

    Link: http://lkml.kernel.org/r/20190402230545.2929-5-tobin@kernel.org
    Signed-off-by: Tobin C. Harding
    Acked-by: Christoph Lameter
    Reviewed-by: Roman Gushchin
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobin C. Harding
     

29 Apr, 2019

1 commit

  • Replace the indirection through struct stack_trace with an invocation of
    the storage array based interface.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Christoph Lameter
    Cc: Andy Lutomirski
    Cc: Andrew Morton
    Cc: Pekka Enberg
    Cc: linux-mm@kvack.org
    Cc: David Rientjes
    Cc: Steven Rostedt
    Cc: Alexander Potapenko
    Cc: Alexey Dobriyan
    Cc: Catalin Marinas
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: kasan-dev@googlegroups.com
    Cc: Mike Rapoport
    Cc: Akinobu Mita
    Cc: Christoph Hellwig
    Cc: iommu@lists.linux-foundation.org
    Cc: Robin Murphy
    Cc: Marek Szyprowski
    Cc: Johannes Thumshirn
    Cc: David Sterba
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: linux-btrfs@vger.kernel.org
    Cc: dm-devel@redhat.com
    Cc: Mike Snitzer
    Cc: Alasdair Kergon
    Cc: Daniel Vetter
    Cc: intel-gfx@lists.freedesktop.org
    Cc: Joonas Lahtinen
    Cc: Maarten Lankhorst
    Cc: dri-devel@lists.freedesktop.org
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Tom Zanussi
    Cc: Miroslav Benes
    Cc: linux-arch@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190425094801.771410441@linutronix.de

    Thomas Gleixner
     

15 Apr, 2019

1 commit

  • No architecture terminates the stack trace with ULONG_MAX anymore. Remove
    the cruft.

    While at it remove the pointless loop of clearing the stack array
    completely. It's sufficient to clear the last entry as the consumers break
    out on the first zeroed entry anyway.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Steven Rostedt
    Cc: Alexander Potapenko
    Cc: Andrew Morton
    Cc: Pekka Enberg
    Cc: linux-mm@kvack.org
    Cc: David Rientjes
    Cc: Christoph Lameter
    Link: https://lkml.kernel.org/r/20190410103644.574058244@linutronix.de

    Thomas Gleixner
     

30 Mar, 2019

1 commit

  • Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
    v6.

    This is a followup to the discussion in [1], [2].

    IOMMUs using ARMv7 short-descriptor format require page tables (level 1
    and 2) to be allocated within the first 4GB of RAM, even on 64-bit
    systems.

    For L1 tables that are bigger than a page, we can just use
    __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
    use GFP_DMA).

    For L2 tables that only take 1KB, it would be a waste to allocate a full
    page, so we considered 3 approaches:
    1. This series, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2 page
    tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable to reuse
    freed fragments until the whole page is freed. [3]

    This series is the most memory-efficient approach.

    stable@ note:
    We confirmed that this is a regression, and IOMMU errors happen on 4.19
    and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
    most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
    with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
    platforms (and maybe others?).

    [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
    [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
    [3] https://patchwork.codeaurora.org/patch/671639/

    This patch (of 3):

    IOMMUs using ARMv7 short-descriptor format require page tables to be
    allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
    this is done by passing GFP_DMA32 flag to memory allocation functions.

    For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
    a full page using get_free_pages, so we considered 3 approaches:
    1. This patch, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2
    page tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable
    to reuse freed fragments until the whole page is freed.

    This change makes it possible to create a custom cache in DMA32 zone using
    kmem_cache_create, then allocate memory using kmem_cache_alloc.

    We do not create a DMA32 kmalloc cache array, as there are currently no
    users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
    warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.

    This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
    kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
    unnecessary).

    Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
    Signed-off-by: Nicolas Boichat
    Acked-by: Vlastimil Babka
    Acked-by: Will Deacon
    Cc: Robin Murphy
    Cc: Joerg Roedel
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Sasha Levin
    Cc: Huaisheng Ye
    Cc: Mike Rapoport
    Cc: Yong Wu
    Cc: Matthias Brugger
    Cc: Tomasz Figa
    Cc: Yingjoe Chen
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Hsin-Yi Wang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Boichat
     

06 Mar, 2019

5 commits

  • Number of NUMA nodes can't be negative.

    This saves a few bytes on x86_64:

    add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
    Function old new delta
    hv_synic_alloc.cold 88 110 +22
    prealloc_shrinker 260 262 +2
    bootstrap 249 251 +2
    sched_init_numa 1566 1567 +1
    show_slab_objects 778 777 -1
    s_show 1201 1200 -1
    kmem_cache_init 346 345 -1
    __alloc_workqueue_key 1146 1145 -1
    mem_cgroup_css_alloc 1614 1612 -2
    __do_sys_swapon 4702 4699 -3
    __list_lru_init 655 651 -4
    nic_probe 2379 2374 -5
    store_user_store 118 111 -7
    red_zone_store 106 99 -7
    poison_store 106 99 -7
    wq_numa_init 348 338 -10
    __kmem_cache_empty 75 65 -10
    task_numa_free 186 173 -13
    merge_across_nodes_store 351 336 -15
    irq_create_affinity_masks 1261 1246 -15
    do_numa_crng_init 343 321 -22
    task_numa_fault 4760 4737 -23
    swapfile_init 179 156 -23
    hv_synic_alloc 536 492 -44
    apply_wqattrs_prepare 746 695 -51

    Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • No functional change.

    Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: Pekka Enberg
    Acked-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • There are two cases when put_cpu_partial() is invoked.

    * __slab_free
    * get_partial_node

    This patch just makes it cover these two cases.

    Link: http://lkml.kernel.org/r/20181025094437.18951-3-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • "addr" function argument is not used in alloc_consistency_checks() at
    all, so remove it.

    Link: http://lkml.kernel.org/r/20190211123214.35592-1-cai@lca.pw
    Fixes: becfda68abca ("slub: convert SLAB_DEBUG_FREE to SLAB_CONSISTENCY_CHECKS")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • new_slab_objects() will return immediately if freelist is not NULL.

    if (freelist)
    return freelist;

    One more assignment operation could be avoided.

    Link: http://lkml.kernel.org/r/20181229062512.30469-1-rocking@whu.edu.cn
    Signed-off-by: Peng Wang
    Reviewed-by: Pekka Enberg
    Reviewed-by: Andrew Morton
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peng Wang
     

22 Feb, 2019

3 commits

  • In process_slab(), "p = get_freepointer()" could return a tagged
    pointer, but "addr = page_address()" always return a native pointer. As
    the result, slab_index() is messed up here,

    return (p - addr) / s->size;

    All other callers of slab_index() have the same situation where "addr"
    is from page_address(), so just need to untag "p".

    # cat /sys/kernel/slab/hugetlbfs_inode_cache/alloc_calls

    Unable to handle kernel paging request at virtual address 2bff808aa4856d48
    Mem abort info:
    ESR = 0x96000007
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000007
    CM = 0, WnR = 0
    swapper pgtable: 64k pages, 48-bit VAs, pgdp = 0000000002498338
    [2bff808aa4856d48] pgd=00000097fcfd0003, pud=00000097fcfd0003, pmd=00000097fca30003, pte=00e8008b24850712
    Internal error: Oops: 96000007 [#1] SMP
    CPU: 3 PID: 79210 Comm: read_all Tainted: G L 5.0.0-rc7+ #84
    Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018
    pstate: 00400089 (nzcv daIf +PAN -UAO)
    pc : get_map+0x78/0xec
    lr : get_map+0xa0/0xec
    sp : aeff808989e3f8e0
    x29: aeff808989e3f940 x28: ffff800826200000
    x27: ffff100012d47000 x26: 9700000000002500
    x25: 0000000000000001 x24: 52ff8008200131f8
    x23: 52ff8008200130a0 x22: 52ff800820013098
    x21: ffff800826200000 x20: ffff100013172ba0
    x19: 2bff808a8971bc00 x18: ffff1000148f5538
    x17: 000000000000001b x16: 00000000000000ff
    x15: ffff1000148f5000 x14: 00000000000000d2
    x13: 0000000000000001 x12: 0000000000000000
    x11: 0000000020000002 x10: 2bff808aa4856d48
    x9 : 0000020000000000 x8 : 68ff80082620ebb0
    x7 : 0000000000000000 x6 : ffff1000105da1dc
    x5 : 0000000000000000 x4 : 0000000000000000
    x3 : 0000000000000010 x2 : 2bff808a8971bc00
    x1 : ffff7fe002098800 x0 : ffff80082620ceb0
    Process read_all (pid: 79210, stack limit = 0x00000000f65b9361)
    Call trace:
    get_map+0x78/0xec
    process_slab+0x7c/0x47c
    list_locations+0xb0/0x3c8
    alloc_calls_show+0x34/0x40
    slab_attr_show+0x34/0x48
    sysfs_kf_seq_show+0x2e4/0x570
    kernfs_seq_show+0x12c/0x1a0
    seq_read+0x48c/0xf84
    kernfs_fop_read+0xd4/0x448
    __vfs_read+0x94/0x5d4
    vfs_read+0xcc/0x194
    ksys_read+0x6c/0xe8
    __arm64_sys_read+0x68/0xb0
    el0_svc_handler+0x230/0x3bc
    el0_svc+0x8/0xc
    Code: d3467d2a 9ac92329 8b0a0e6a f9800151 (c85f7d4b)
    ---[ end trace a383a9a44ff13176 ]---
    Kernel panic - not syncing: Fatal exception
    SMP: stopping secondary CPUs
    SMP: failed to stop secondary CPUs 1-7,32,40,127
    Kernel Offset: disabled
    CPU features: 0x002,20000c18
    Memory Limit: none
    ---[ end Kernel panic - not syncing: Fatal exception ]---

    Link: http://lkml.kernel.org/r/20190220020251.82039-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Reviewed-by: Andrey Konovalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • Enabling SLUB_DEBUG's SLAB_CONSISTENCY_CHECKS with KASAN_SW_TAGS
    triggers endless false positives during boot below due to
    check_valid_pointer() checks tagged pointers which have no addresses
    that is valid within slab pages:

    BUG radix_tree_node (Tainted: G B ): Freelist Pointer check fails
    -----------------------------------------------------------------------------

    INFO: Slab objects=69 used=69 fp=0x (null) flags=0x7ffffffc000200
    INFO: Object @offset=15060037153926966016 fp=0x

    Redzone: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 18 6b 06 00 08 80 ff d0 .........k......
    Object : 18 6b 06 00 08 80 ff d0 00 00 00 00 00 00 00 00 .k..............
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Redzone: bb bb bb bb bb bb bb bb ........
    Padding: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
    CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B 5.0.0-rc5+ #18
    Call trace:
    dump_backtrace+0x0/0x450
    show_stack+0x20/0x2c
    __dump_stack+0x20/0x28
    dump_stack+0xa0/0xfc
    print_trailer+0x1bc/0x1d0
    object_err+0x40/0x50
    alloc_debug_processing+0xf0/0x19c
    ___slab_alloc+0x554/0x704
    kmem_cache_alloc+0x2f8/0x440
    radix_tree_node_alloc+0x90/0x2fc
    idr_get_free+0x1e8/0x6d0
    idr_alloc_u32+0x11c/0x2a4
    idr_alloc+0x74/0xe0
    worker_pool_assign_id+0x5c/0xbc
    workqueue_init_early+0x49c/0xd50
    start_kernel+0x52c/0xac4
    FIX radix_tree_node: Marking all objects used

    Link: http://lkml.kernel.org/r/20190209044128.3290-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Reviewed-by: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • When CONFIG_KASAN_SW_TAGS is enabled, ptr_addr might be tagged. Normally,
    this doesn't cause any issues, as both set_freepointer() and
    get_freepointer() are called with a pointer with the same tag. However,
    there are some issues with CONFIG_SLUB_DEBUG code. For example, when
    __free_slub() iterates over objects in a cache, it passes untagged
    pointers to check_object(). check_object() in turns calls
    get_freepointer() with an untagged pointer, which causes the freepointer
    to be restored incorrectly.

    Add kasan_reset_tag to freelist_ptr(). Also add a detailed comment.

    Link: http://lkml.kernel.org/r/bf858f26ef32eb7bd24c665755b3aee4bc58d0e4.1550103861.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reported-by: Qian Cai
    Tested-by: Qian Cai
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov