23 Jan, 2020

1 commit

  • commit 8e57f8acbbd121ecfb0c9dc13b8b030f86c6bd3b upstream.

    Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
    debugging") has introduced a static key to reduce overhead when
    debug_pagealloc is compiled in but not enabled. It relied on the
    assumption that jump_label_init() is called before parse_early_param()
    as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
    it is safe to enable the static key.

    However, it turns out multiple architectures call parse_early_param()
    earlier from their setup_arch(). x86 also calls jump_label_init() even
    earlier, so no issue was found while testing the commit, but same is not
    true for e.g. ppc64 and s390 where the kernel would not boot with
    debug_pagealloc=on as found by our QA.

    To fix this without tricky changes to init code of multiple
    architectures, this patch partially reverts the static key conversion
    from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
    code) of debug_pagealloc_enabled() will again test a simple bool
    variable. Fastpath mm code is converted to a new
    debug_pagealloc_enabled_static() variant that relies on the static key,
    which is enabled in a well-defined point in mm_init() where it's
    guaranteed that jump_label_init() has been called, regardless of
    architecture.

    [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
    Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
    Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Stephen Rothwell
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Qian Cai
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

16 Nov, 2019

1 commit

  • Commit 1b7e816fc80e ("mm: slub: Fix slab walking for init_on_free")
    fixed one problem with the slab walking but missed a key detail: When
    walking the list, the head and tail pointers need to be updated since we
    end up reversing the list as a result. Without doing this, bulk free is
    broken.

    One way this is exposed is a NULL pointer with slub_debug=F:

    =============================================================================
    BUG skbuff_head_cache (Tainted: G T): Object already free
    -----------------------------------------------------------------------------

    INFO: Slab 0x000000000d2d2f8f objects=16 used=3 fp=0x0000000064309071 flags=0x3fff00000000201
    BUG: kernel NULL pointer dereference, address: 0000000000000000
    Oops: 0000 [#1] PREEMPT SMP PTI
    RIP: 0010:print_trailer+0x70/0x1d5
    Call Trace:

    free_debug_processing.cold.37+0xc9/0x149
    __slab_free+0x22a/0x3d0
    kmem_cache_free_bulk+0x415/0x420
    __kfree_skb_flush+0x30/0x40
    net_rx_action+0x2dd/0x480
    __do_softirq+0xf0/0x246
    irq_exit+0x93/0xb0
    do_IRQ+0xa0/0x110
    common_interrupt+0xf/0xf

    Given we're now almost identical to the existing debugging code which
    correctly walks the list, combine with that.

    Link: https://lkml.kernel.org/r/20191104170303.GA50361@gandi.net
    Link: http://lkml.kernel.org/r/20191106222208.26815-1-labbott@redhat.com
    Fixes: 1b7e816fc80e ("mm: slub: Fix slab walking for init_on_free")
    Signed-off-by: Laura Abbott
    Reported-by: Thibaut Sautereau
    Acked-by: David Rientjes
    Tested-by: Alexander Potapenko
    Acked-by: Alexander Potapenko
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Vlastimil Babka
    Cc:
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     

15 Oct, 2019

2 commits

  • slab_alloc_node() already zeroed out the freelist pointer if
    init_on_free was on. Thibaut Sautereau noticed that the same needs to
    be done for kmem_cache_alloc_bulk(), which performs the allocations
    separately.

    kmem_cache_alloc_bulk() is currently used in two places in the kernel,
    so this change is unlikely to have a major performance impact.

    SLAB doesn't require a similar change, as auto-initialization makes the
    allocator store the freelist pointers off-slab.

    Link: http://lkml.kernel.org/r/20191007091605.30530-1-glider@google.com
    Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
    Signed-off-by: Alexander Potapenko
    Reported-by: Thibaut Sautereau
    Reported-by: Kees Cook
    Cc: Christoph Lameter
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • A long time ago we fixed a similar deadlock in show_slab_objects() [1].
    However, it is apparently due to the commits like 01fb58bcba63 ("slab:
    remove synchronous synchronize_sched() from memcg cache deactivation
    path") and 03afc0e25f7f ("slab: get_online_mems for
    kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by
    just reading files in /sys/kernel/slab which will generate a lockdep
    splat below.

    Since the "mem_hotplug_lock" here is only to obtain a stable online node
    mask while racing with NUMA node hotplug, in the worst case, the results
    may me miscalculated while doing NUMA node hotplug, but they shall be
    corrected by later reads of the same files.

    WARNING: possible circular locking dependency detected
    ------------------------------------------------------
    cat/5224 is trying to acquire lock:
    ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at:
    show_slab_objects+0x94/0x3a8

    but task is already holding lock:
    b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (kn->count#45){++++}:
    lock_acquire+0x31c/0x360
    __kernfs_remove+0x290/0x490
    kernfs_remove+0x30/0x44
    sysfs_remove_dir+0x70/0x88
    kobject_del+0x50/0xb0
    sysfs_slab_unlink+0x2c/0x38
    shutdown_cache+0xa0/0xf0
    kmemcg_cache_shutdown_fn+0x1c/0x34
    kmemcg_workfn+0x44/0x64
    process_one_work+0x4f4/0x950
    worker_thread+0x390/0x4bc
    kthread+0x1cc/0x1e8
    ret_from_fork+0x10/0x18

    -> #1 (slab_mutex){+.+.}:
    lock_acquire+0x31c/0x360
    __mutex_lock_common+0x16c/0xf78
    mutex_lock_nested+0x40/0x50
    memcg_create_kmem_cache+0x38/0x16c
    memcg_kmem_cache_create_func+0x3c/0x70
    process_one_work+0x4f4/0x950
    worker_thread+0x390/0x4bc
    kthread+0x1cc/0x1e8
    ret_from_fork+0x10/0x18

    -> #0 (mem_hotplug_lock.rw_sem){++++}:
    validate_chain+0xd10/0x2bcc
    __lock_acquire+0x7f4/0xb8c
    lock_acquire+0x31c/0x360
    get_online_mems+0x54/0x150
    show_slab_objects+0x94/0x3a8
    total_objects_show+0x28/0x34
    slab_attr_show+0x38/0x54
    sysfs_kf_seq_show+0x198/0x2d4
    kernfs_seq_show+0xa4/0xcc
    seq_read+0x30c/0x8a8
    kernfs_fop_read+0xa8/0x314
    __vfs_read+0x88/0x20c
    vfs_read+0xd8/0x10c
    ksys_read+0xb0/0x120
    __arm64_sys_read+0x54/0x88
    el0_svc_handler+0x170/0x240
    el0_svc+0x8/0xc

    other info that might help us debug this:

    Chain exists of:
    mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(kn->count#45);
    lock(slab_mutex);
    lock(kn->count#45);
    lock(mem_hotplug_lock.rw_sem);

    *** DEADLOCK ***

    3 locks held by cat/5224:
    #0: 9eff00095b14b2a0 (&p->lock){+.+.}, at: seq_read+0x4c/0x8a8
    #1: 0eff008997041480 (&of->mutex){+.+.}, at: kernfs_seq_start+0x34/0xf0
    #2: b8ff009693eee398 (kn->count#45){++++}, at:
    kernfs_seq_start+0x44/0xf0

    stack backtrace:
    Call trace:
    dump_backtrace+0x0/0x248
    show_stack+0x20/0x2c
    dump_stack+0xd0/0x140
    print_circular_bug+0x368/0x380
    check_noncircular+0x248/0x250
    validate_chain+0xd10/0x2bcc
    __lock_acquire+0x7f4/0xb8c
    lock_acquire+0x31c/0x360
    get_online_mems+0x54/0x150
    show_slab_objects+0x94/0x3a8
    total_objects_show+0x28/0x34
    slab_attr_show+0x38/0x54
    sysfs_kf_seq_show+0x198/0x2d4
    kernfs_seq_show+0xa4/0xcc
    seq_read+0x30c/0x8a8
    kernfs_fop_read+0xa8/0x314
    __vfs_read+0x88/0x20c
    vfs_read+0xd8/0x10c
    ksys_read+0xb0/0x120
    __arm64_sys_read+0x54/0x88
    el0_svc_handler+0x170/0x240
    el0_svc+0x8/0xc

    I think it is important to mention that this doesn't expose the
    show_slab_objects to use-after-free. There is only a single path that
    might really race here and that is the slab hotplug notifier callback
    __kmem_cache_shrink (via slab_mem_going_offline_callback) but that path
    doesn't really destroy kmem_cache_node data structures.

    [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.0/02850.html

    [akpm@linux-foundation.org: add comment explaining why we don't need mem_hotplug_lock]
    Link: http://lkml.kernel.org/r/1570192309-10132-1-git-send-email-cai@lca.pw
    Fixes: 01fb58bcba63 ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path")
    Fixes: 03afc0e25f7f ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}")
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

08 Oct, 2019

1 commit

  • Patch series "guarantee natural alignment for kmalloc()", v2.

    This patch (of 2):

    SLOB currently doesn't account its pages at all, so in /proc/meminfo the
    Slab field shows zero. Modifying a counter on page allocation and
    freeing should be acceptable even for the small system scenarios SLOB is
    intended for. Since reclaimable caches are not separated in SLOB,
    account everything as unreclaimable.

    SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
    larger than order-1 page, that are passed directly to the page
    allocator. As they also don't appear in /proc/slabinfo, it might look
    like a memory leak. For consistency, account them as well. (SLAB
    doesn't actually use page allocator directly, so no change there).

    Ideally SLOB and SLUB would be handled in separate patches, but due to
    the shared kmalloc_order() function and different kfree()
    implementations, it's easier to patch both at once to prevent
    inconsistencies.

    Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Ming Lei
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: "Darrick J . Wong"
    Cc: Christoph Hellwig
    Cc: James Bottomley
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

25 Sep, 2019

3 commits

  • Patch series "Make working with compound pages easier", v2.

    These three patches add three helpers and convert the appropriate
    places to use them.

    This patch (of 3):

    It's unnecessarily hard to find out the size of a potentially huge page.
    Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).

    Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • tid_to_cpu() and tid_to_event() are only used in note_cmpxchg_failure()
    when SLUB_DEBUG_CMPXCHG=y, so when SLUB_DEBUG_CMPXCHG=n by default, Clang
    will complain that those unused functions.

    Link: http://lkml.kernel.org/r/1568752232-5094-1-git-send-email-cai@lca.pw
    Signed-off-by: Qian Cai
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • Currently, a value of '1" is written to /sys/kernel/slab//shrink
    file to shrink the slab by flushing out all the per-cpu slabs and free
    slabs in partial lists. This can be useful to squeeze out a bit more
    memory under extreme condition as well as making the active object counts
    in /proc/slabinfo more accurate.

    This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
    option is usually not enabled and "slub_memcg_sysfs=1" not set. Even if
    memcg sysfs is turned on, it is too cumbersome and impractical to manage
    all those per-memcg sysfs files in a real production system.

    So there is no practical way to shrink memcg caches. Fix this by enabling
    a proper write to the shrink sysfs file of the root cache to scan all the
    available memcg caches and shrink them as well. For a non-root memcg
    cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is on), only that
    cache will be shrunk when written.

    On a 2-socket 64-core 256-thread arm64 system with 64k page after
    a parallel kernel build, the the amount of memory occupied by slabs
    before shrinking slabs were:

    # grep task_struct /proc/slabinfo
    task_struct 53137 53192 4288 61 4 : tunables 0 0
    0 : slabdata 872 872 0
    # grep "^S[lRU]" /proc/meminfo
    Slab: 3936832 kB
    SReclaimable: 399104 kB
    SUnreclaim: 3537728 kB

    After shrinking slabs (by echoing "1" to all shrink files):

    # grep "^S[lRU]" /proc/meminfo
    Slab: 1356288 kB
    SReclaimable: 263296 kB
    SUnreclaim: 1092992 kB
    # grep task_struct /proc/slabinfo
    task_struct 2764 6832 4288 61 4 : tunables 0 0
    0 : slabdata 112 112 0

    Link: http://lkml.kernel.org/r/20190723151445.7385-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Acked-by: Roman Gushchin
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     

01 Aug, 2019

1 commit

  • To properly clear the slab on free with slab_want_init_on_free, we walk
    the list of free objects using get_freepointer/set_freepointer.

    The value we get from get_freepointer may not be valid. This isn't an
    issue since an actual value will get written later but this means
    there's a chance of triggering a bug if we use this value with
    set_freepointer:

    kernel BUG at mm/slub.c:306!
    invalid opcode: 0000 [#1] PREEMPT PTI
    CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-05754-g6471384a #4
    RIP: 0010:kfree+0x58a/0x5c0
    Code: 48 83 05 78 37 51 02 01 0f 0b 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 d6 37 51 02 01 0b 48 83 05 d4 37 51 02 01 48 83 05 d4 37 51 02 01 48 83 05 d4
    RSP: 0000:ffffffff82603d90 EFLAGS: 00010002
    RAX: ffff8c3976c04320 RBX: ffff8c3976c04300 RCX: 0000000000000000
    RDX: ffff8c3976c04300 RSI: 0000000000000000 RDI: ffff8c3976c04320
    RBP: ffffffff82603db8 R08: 0000000000000000 R09: 0000000000000000
    R10: ffff8c3976c04320 R11: ffffffff8289e1e0 R12: ffffd52cc8db0100
    R13: ffff8c3976c01a00 R14: ffffffff810f10d4 R15: ffff8c3976c04300
    FS: 0000000000000000(0000) GS:ffffffff8266b000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff8c397ffff000 CR3: 0000000125020000 CR4: 00000000000406b0
    Call Trace:
    apply_wqattrs_prepare+0x154/0x280
    apply_workqueue_attrs_locked+0x4e/0xe0
    apply_workqueue_attrs+0x36/0x60
    alloc_workqueue+0x25a/0x6d0
    workqueue_init_early+0x246/0x348
    start_kernel+0x3c7/0x7ec
    x86_64_start_reservations+0x40/0x49
    x86_64_start_kernel+0xda/0xe4
    secondary_startup_64+0xb6/0xc0
    Modules linked in:
    ---[ end trace f67eb9af4d8d492b ]---

    Fix this by ensuring the value we set with set_freepointer is either NULL
    or another value in the chain.

    Reported-by: kernel test robot
    Signed-off-by: Laura Abbott
    Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
    Reviewed-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Laura Abbott
     

13 Jul, 2019

7 commits

  • Patch series "add init_on_alloc/init_on_free boot options", v10.

    Provide init_on_alloc and init_on_free boot options.

    These are aimed at preventing possible information leaks and making the
    control-flow bugs that depend on uninitialized values more deterministic.

    Enabling either of the options guarantees that the memory returned by the
    page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
    isn't supported at the moment, as its emulation of kmem caches complicates
    handling of SLAB_TYPESAFE_BY_RCU caches correctly.

    Enabling init_on_free also guarantees that pages and heap objects are
    initialized right after they're freed, so it won't be possible to access
    stale data by using a dangling pointer.

    As suggested by Michal Hocko, right now we don't let the heap users to
    disable initialization for certain allocations. There's not enough
    evidence that doing so can speed up real-life cases, and introducing ways
    to opt-out may result in things going out of control.

    This patch (of 2):

    The new options are needed to prevent possible information leaks and make
    control-flow bugs that depend on uninitialized values more deterministic.

    This is expected to be on-by-default on Android and Chrome OS. And it
    gives the opportunity for anyone else to use it under distros too via the
    boot args. (The init_on_free feature is regularly requested by folks
    where memory forensics is included in their threat models.)

    init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
    objects with zeroes. Initialization is done at allocation time at the
    places where checks for __GFP_ZERO are performed.

    init_on_free=1 makes the kernel initialize freed pages and heap objects
    with zeroes upon their deletion. This helps to ensure sensitive data
    doesn't leak via use-after-free accesses.

    Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
    returns zeroed memory. The two exceptions are slab caches with
    constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
    zero-initialized to preserve their semantics.

    Both init_on_alloc and init_on_free default to zero, but those defaults
    can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
    CONFIG_INIT_ON_FREE_DEFAULT_ON.

    If either SLUB poisoning or page poisoning is enabled, those options take
    precedence over init_on_alloc and init_on_free: initialization is only
    applied to unpoisoned allocations.

    Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:

    hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
    hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)

    Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
    Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
    Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
    Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)

    The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
    is within the standard error.

    The new features are also going to pave the way for hardware memory
    tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
    hooks to set the tags for heap objects. With MTE, tagging will have the
    same cost as memory initialization.

    Although init_on_free is rather costly, there are paranoid use-cases where
    in-memory data lifetime is desired to be minimized. There are various
    arguments for/against the realism of the associated threat models, but
    given that we'll need the infrastructure for MTE anyway, and there are
    people who want wipe-on-free behavior no matter what the performance cost,
    it seems reasonable to include it in this series.

    [glider@google.com: v8]
    Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
    [glider@google.com: v10]
    Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
    Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
    Signed-off-by: Alexander Potapenko
    Acked-by: Kees Cook
    Acked-by: Michal Hocko [page and dmapool parts
    Acked-by: James Morris ]
    Cc: Christoph Lameter
    Cc: Masahiro Yamada
    Cc: "Serge E. Hallyn"
    Cc: Nick Desaulniers
    Cc: Kostya Serebryany
    Cc: Dmitry Vyukov
    Cc: Sandeep Patil
    Cc: Laura Abbott
    Cc: Randy Dunlap
    Cc: Jann Horn
    Cc: Mark Rutland
    Cc: Marco Elver
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Currently the page accounting code is duplicated in SLAB and SLUB
    internals. Let's move it into new (un)charge_slab_page helpers in the
    slab_common.c file. These helpers will be responsible for statistics
    (global and memcg-aware) and memcg charging. So they are replacing direct
    memcg_(un)charge_slab() calls.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Christoph Lameter
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently SLUB uses a work scheduled after an RCU grace period to
    deactivate a non-root kmem_cache. This mechanism can be reused for
    kmem_caches release, but requires generalization for SLAB case.

    Introduce kmemcg_cache_deactivate() function, which calls
    allocator-specific __kmem_cache_deactivate() and schedules execution of
    __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
    context after an rcu grace period.

    Here is the new calling scheme:
    kmemcg_cache_deactivate()
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    __kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific

    instead of:
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    slab_deactivate_memcg_cache_rcu_sched() SLUB-only
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    kmemcg_cache_deact_after_rcu() SLUB-only

    For consistency, all allocator-specific functions start with "__".

    Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "mm: reparent slab memory on cgroup removal", v7.

    # Why do we need this?

    We've noticed that the number of dying cgroups is steadily growing on most
    of our hosts in production. The following investigation revealed an issue
    in the userspace memory reclaim code [1], accounting of kernel stacks [2],
    and also the main reason: slab objects.

    The underlying problem is quite simple: any page charged to a cgroup holds
    a reference to it, so the cgroup can't be reclaimed unless all charged
    pages are gone. If a slab object is actively used by other cgroups, it
    won't be reclaimed, and will prevent the origin cgroup from being
    reclaimed.

    Slab objects, and first of all vfs cache, is shared between cgroups, which
    are using the same underlying fs, and what's even more important, it's
    shared between multiple generations of the same workload. So if something
    is running periodically every time in a new cgroup (like how systemd
    works), we do accumulate multiple dying cgroups.

    Strictly speaking pagecache isn't different here, but there is a key
    difference: we disable protection and apply some extra pressure on LRUs of
    dying cgroups, and these LRUs contain all charged pages. My experiments
    show that with the disabled kernel memory accounting the number of dying
    cgroups stabilizes at a relatively small number (~100, depends on memory
    pressure and cgroup creation rate), and with kernel memory accounting it
    grows pretty steadily up to several thousands.

    Memory cgroups are quite complex and big objects (mostly due to percpu
    stats), so it leads to noticeable memory losses. Memory occupied by dying
    cgroups is measured in hundreds of megabytes. I've even seen a host with
    more than 100Gb of memory wasted for dying cgroups. It leads to a
    degradation of performance with the uptime, and generally limits the usage
    of cgroups.

    My previous attempt [3] to fix the problem by applying extra pressure on
    slab shrinker lists caused a regressions with xfs and ext4, and has been
    reverted [4]. The following attempts to find the right balance [5, 6]
    were not successful.

    So instead of trying to find a maybe non-existing balance, let's do
    reparent accounted slab caches to the parent cgroup on cgroup removal.

    # Implementation approach

    There is however a significant problem with reparenting of slab memory:
    there is no list of charged pages. Some of them are in shrinker lists,
    but not all. Introducing of a new list is really not an option.

    But fortunately there is a way forward: every slab page has a stable
    pointer to the corresponding kmem_cache. So the idea is to reparent
    kmem_caches instead of slab pages.

    It's actually simpler and cheaper, but requires some underlying changes:
    1) Make kmem_caches to hold a single reference to the memory cgroup,
    instead of a separate reference per every slab page.
    2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
    page->kmem_cache->memcg indirection instead. It's used only on
    slab page release, so performance overhead shouldn't be a big issue.
    3) Introduce a refcounter for non-root slab caches. It's required to
    be able to destroy kmem_caches when they become empty and release
    the associated memory cgroup.

    There is a bonus: currently we release all memcg kmem_caches all together
    with the memory cgroup itself. This patchset allows individual
    kmem_caches to be released as soon as they become inactive and free.

    Some additional implementation details are provided in corresponding
    commit messages.

    # Results

    Below is the average number of dying cgroups on two groups of our
    production hosts. They do run some sort of web frontend workload, the
    memory pressure is moderate. As we can see, with the kernel memory
    reparenting the number stabilizes in 60s range; however with the original
    version it grows almost linearly and doesn't show any signs of plateauing.
    The difference in slab and percpu usage between patched and unpatched
    versions also grows linearly. In 7 days it exceeded 200Mb.

    day 0 1 2 3 4 5 6 7
    original 56 362 628 752 1070 1250 1490 1560
    patched 23 46 51 55 60 57 67 69
    mem diff(Mb) 22 74 123 152 164 182 214 241

    # Links

    [1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error")
    [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
    [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
    [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
    [5]: https://lkml.org/lkml/2019/1/28/1865
    [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2

    This patch (of 10):

    Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
    rather than in init_memcg_params().

    Once kmem_cache will hold a reference to the memory cgroup, it will
    simplify the refcounting.

    For non-root kmem_caches memcg_link_cache() is always called before the
    kmem_cache becomes visible to a user, so it's safe.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Waiman Long
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This refactors common code of ksize() between the various allocators into
    slab_common.c: __ksize() is the allocator-specific implementation without
    instrumentation, whereas ksize() includes the required KASAN logic.

    Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.com
    Signed-off-by: Marco Elver
    Acked-by: Christoph Lameter
    Reviewed-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mark Rutland
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marco Elver
     
  • Currently for CONFIG_SLUB, if a memcg kmem cache creation is failed and
    the corresponding root kmem cache has SLAB_PANIC flag, the kernel will
    be crashed. This is unnecessary as the kernel can handle the creation
    failures of memcg kmem caches. Additionally CONFIG_SLAB does not
    implement this behavior. So, to keep the behavior consistent between
    SLAB and SLUB, removing the panic for memcg kmem cache creation
    failures. The root kmem cache creation failure for SLAB_PANIC correctly
    panics for both SLAB and SLUB.

    Link: http://lkml.kernel.org/r/20190619232514.58994-1-shakeelb@google.com
    Reported-by: Dave Hansen
    Signed-off-by: Shakeel Butt
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Roman Gushchin
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • If ',' is not found, kmem_cache_flags() calls strlen() to find the end of
    line. We can do it in a single pass using strchrnul().

    Link: http://lkml.kernel.org/r/20190501053111.7950-1-ynorov@marvell.com
    Signed-off-by: Yury Norov
    Acked-by: Aaron Tomlin
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yury Norov
     

15 May, 2019

4 commits

  • Now frozen slab can only be on the per cpu partial list.

    Link: http://lkml.kernel.org/r/1554022325-11305-1-git-send-email-liu.xiang6@zte.com.cn
    Signed-off-by: Liu Xiang
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liu Xiang
     
  • When CONFIG_SLUB_DEBUG is not enabled, remove_full() is empty.
    While CONFIG_SLUB_DEBUG is enabled, remove_full() can check
    s->flags by itself. So kmem_cache_debug() is useless and
    can be removed.

    Link: http://lkml.kernel.org/r/1552577313-2830-1-git-send-email-liu.xiang6@zte.com.cn
    Signed-off-by: Liu Xiang
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liu Xiang
     
  • Currently we use the page->lru list for maintaining lists of slabs. We
    have a list in the page structure (slab_list) that can be used for this
    purpose. Doing so makes the code cleaner since we are not overloading the
    lru list.

    Use the slab_list instead of the lru list for maintaining lists of slabs.

    Link: http://lkml.kernel.org/r/20190402230545.2929-6-tobin@kernel.org
    Signed-off-by: Tobin C. Harding
    Acked-by: Christoph Lameter
    Reviewed-by: Roman Gushchin
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobin C. Harding
     
  • SLUB allocator makes heavy use of ifdef/endif pre-processor macros. The
    pairing of these statements is at times hard to follow e.g. if the pair
    are further than a screen apart or if there are nested pairs. We can
    reduce cognitive load by adding a comment to the endif statement of form

    #ifdef CONFIG_FOO
    ...
    #endif /* CONFIG_FOO */

    Add comments to endif pre-processor macros if ifdef/endif pair is not
    immediately apparent.

    Link: http://lkml.kernel.org/r/20190402230545.2929-5-tobin@kernel.org
    Signed-off-by: Tobin C. Harding
    Acked-by: Christoph Lameter
    Reviewed-by: Roman Gushchin
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobin C. Harding
     

29 Apr, 2019

1 commit

  • Replace the indirection through struct stack_trace with an invocation of
    the storage array based interface.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Christoph Lameter
    Cc: Andy Lutomirski
    Cc: Andrew Morton
    Cc: Pekka Enberg
    Cc: linux-mm@kvack.org
    Cc: David Rientjes
    Cc: Steven Rostedt
    Cc: Alexander Potapenko
    Cc: Alexey Dobriyan
    Cc: Catalin Marinas
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: kasan-dev@googlegroups.com
    Cc: Mike Rapoport
    Cc: Akinobu Mita
    Cc: Christoph Hellwig
    Cc: iommu@lists.linux-foundation.org
    Cc: Robin Murphy
    Cc: Marek Szyprowski
    Cc: Johannes Thumshirn
    Cc: David Sterba
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: linux-btrfs@vger.kernel.org
    Cc: dm-devel@redhat.com
    Cc: Mike Snitzer
    Cc: Alasdair Kergon
    Cc: Daniel Vetter
    Cc: intel-gfx@lists.freedesktop.org
    Cc: Joonas Lahtinen
    Cc: Maarten Lankhorst
    Cc: dri-devel@lists.freedesktop.org
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Tom Zanussi
    Cc: Miroslav Benes
    Cc: linux-arch@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190425094801.771410441@linutronix.de

    Thomas Gleixner
     

15 Apr, 2019

1 commit

  • No architecture terminates the stack trace with ULONG_MAX anymore. Remove
    the cruft.

    While at it remove the pointless loop of clearing the stack array
    completely. It's sufficient to clear the last entry as the consumers break
    out on the first zeroed entry anyway.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Steven Rostedt
    Cc: Alexander Potapenko
    Cc: Andrew Morton
    Cc: Pekka Enberg
    Cc: linux-mm@kvack.org
    Cc: David Rientjes
    Cc: Christoph Lameter
    Link: https://lkml.kernel.org/r/20190410103644.574058244@linutronix.de

    Thomas Gleixner
     

30 Mar, 2019

1 commit

  • Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
    v6.

    This is a followup to the discussion in [1], [2].

    IOMMUs using ARMv7 short-descriptor format require page tables (level 1
    and 2) to be allocated within the first 4GB of RAM, even on 64-bit
    systems.

    For L1 tables that are bigger than a page, we can just use
    __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
    use GFP_DMA).

    For L2 tables that only take 1KB, it would be a waste to allocate a full
    page, so we considered 3 approaches:
    1. This series, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2 page
    tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable to reuse
    freed fragments until the whole page is freed. [3]

    This series is the most memory-efficient approach.

    stable@ note:
    We confirmed that this is a regression, and IOMMU errors happen on 4.19
    and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
    most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
    with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
    platforms (and maybe others?).

    [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
    [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
    [3] https://patchwork.codeaurora.org/patch/671639/

    This patch (of 3):

    IOMMUs using ARMv7 short-descriptor format require page tables to be
    allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
    this is done by passing GFP_DMA32 flag to memory allocation functions.

    For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
    a full page using get_free_pages, so we considered 3 approaches:
    1. This patch, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2
    page tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable
    to reuse freed fragments until the whole page is freed.

    This change makes it possible to create a custom cache in DMA32 zone using
    kmem_cache_create, then allocate memory using kmem_cache_alloc.

    We do not create a DMA32 kmalloc cache array, as there are currently no
    users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
    warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.

    This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
    kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
    unnecessary).

    Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
    Signed-off-by: Nicolas Boichat
    Acked-by: Vlastimil Babka
    Acked-by: Will Deacon
    Cc: Robin Murphy
    Cc: Joerg Roedel
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Sasha Levin
    Cc: Huaisheng Ye
    Cc: Mike Rapoport
    Cc: Yong Wu
    Cc: Matthias Brugger
    Cc: Tomasz Figa
    Cc: Yingjoe Chen
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Hsin-Yi Wang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Boichat
     

06 Mar, 2019

5 commits

  • Number of NUMA nodes can't be negative.

    This saves a few bytes on x86_64:

    add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
    Function old new delta
    hv_synic_alloc.cold 88 110 +22
    prealloc_shrinker 260 262 +2
    bootstrap 249 251 +2
    sched_init_numa 1566 1567 +1
    show_slab_objects 778 777 -1
    s_show 1201 1200 -1
    kmem_cache_init 346 345 -1
    __alloc_workqueue_key 1146 1145 -1
    mem_cgroup_css_alloc 1614 1612 -2
    __do_sys_swapon 4702 4699 -3
    __list_lru_init 655 651 -4
    nic_probe 2379 2374 -5
    store_user_store 118 111 -7
    red_zone_store 106 99 -7
    poison_store 106 99 -7
    wq_numa_init 348 338 -10
    __kmem_cache_empty 75 65 -10
    task_numa_free 186 173 -13
    merge_across_nodes_store 351 336 -15
    irq_create_affinity_masks 1261 1246 -15
    do_numa_crng_init 343 321 -22
    task_numa_fault 4760 4737 -23
    swapfile_init 179 156 -23
    hv_synic_alloc 536 492 -44
    apply_wqattrs_prepare 746 695 -51

    Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • No functional change.

    Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: Pekka Enberg
    Acked-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • There are two cases when put_cpu_partial() is invoked.

    * __slab_free
    * get_partial_node

    This patch just makes it cover these two cases.

    Link: http://lkml.kernel.org/r/20181025094437.18951-3-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • "addr" function argument is not used in alloc_consistency_checks() at
    all, so remove it.

    Link: http://lkml.kernel.org/r/20190211123214.35592-1-cai@lca.pw
    Fixes: becfda68abca ("slub: convert SLAB_DEBUG_FREE to SLAB_CONSISTENCY_CHECKS")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • new_slab_objects() will return immediately if freelist is not NULL.

    if (freelist)
    return freelist;

    One more assignment operation could be avoided.

    Link: http://lkml.kernel.org/r/20181229062512.30469-1-rocking@whu.edu.cn
    Signed-off-by: Peng Wang
    Reviewed-by: Pekka Enberg
    Reviewed-by: Andrew Morton
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peng Wang
     

22 Feb, 2019

7 commits

  • In process_slab(), "p = get_freepointer()" could return a tagged
    pointer, but "addr = page_address()" always return a native pointer. As
    the result, slab_index() is messed up here,

    return (p - addr) / s->size;

    All other callers of slab_index() have the same situation where "addr"
    is from page_address(), so just need to untag "p".

    # cat /sys/kernel/slab/hugetlbfs_inode_cache/alloc_calls

    Unable to handle kernel paging request at virtual address 2bff808aa4856d48
    Mem abort info:
    ESR = 0x96000007
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000007
    CM = 0, WnR = 0
    swapper pgtable: 64k pages, 48-bit VAs, pgdp = 0000000002498338
    [2bff808aa4856d48] pgd=00000097fcfd0003, pud=00000097fcfd0003, pmd=00000097fca30003, pte=00e8008b24850712
    Internal error: Oops: 96000007 [#1] SMP
    CPU: 3 PID: 79210 Comm: read_all Tainted: G L 5.0.0-rc7+ #84
    Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018
    pstate: 00400089 (nzcv daIf +PAN -UAO)
    pc : get_map+0x78/0xec
    lr : get_map+0xa0/0xec
    sp : aeff808989e3f8e0
    x29: aeff808989e3f940 x28: ffff800826200000
    x27: ffff100012d47000 x26: 9700000000002500
    x25: 0000000000000001 x24: 52ff8008200131f8
    x23: 52ff8008200130a0 x22: 52ff800820013098
    x21: ffff800826200000 x20: ffff100013172ba0
    x19: 2bff808a8971bc00 x18: ffff1000148f5538
    x17: 000000000000001b x16: 00000000000000ff
    x15: ffff1000148f5000 x14: 00000000000000d2
    x13: 0000000000000001 x12: 0000000000000000
    x11: 0000000020000002 x10: 2bff808aa4856d48
    x9 : 0000020000000000 x8 : 68ff80082620ebb0
    x7 : 0000000000000000 x6 : ffff1000105da1dc
    x5 : 0000000000000000 x4 : 0000000000000000
    x3 : 0000000000000010 x2 : 2bff808a8971bc00
    x1 : ffff7fe002098800 x0 : ffff80082620ceb0
    Process read_all (pid: 79210, stack limit = 0x00000000f65b9361)
    Call trace:
    get_map+0x78/0xec
    process_slab+0x7c/0x47c
    list_locations+0xb0/0x3c8
    alloc_calls_show+0x34/0x40
    slab_attr_show+0x34/0x48
    sysfs_kf_seq_show+0x2e4/0x570
    kernfs_seq_show+0x12c/0x1a0
    seq_read+0x48c/0xf84
    kernfs_fop_read+0xd4/0x448
    __vfs_read+0x94/0x5d4
    vfs_read+0xcc/0x194
    ksys_read+0x6c/0xe8
    __arm64_sys_read+0x68/0xb0
    el0_svc_handler+0x230/0x3bc
    el0_svc+0x8/0xc
    Code: d3467d2a 9ac92329 8b0a0e6a f9800151 (c85f7d4b)
    ---[ end trace a383a9a44ff13176 ]---
    Kernel panic - not syncing: Fatal exception
    SMP: stopping secondary CPUs
    SMP: failed to stop secondary CPUs 1-7,32,40,127
    Kernel Offset: disabled
    CPU features: 0x002,20000c18
    Memory Limit: none
    ---[ end Kernel panic - not syncing: Fatal exception ]---

    Link: http://lkml.kernel.org/r/20190220020251.82039-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Reviewed-by: Andrey Konovalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • Enabling SLUB_DEBUG's SLAB_CONSISTENCY_CHECKS with KASAN_SW_TAGS
    triggers endless false positives during boot below due to
    check_valid_pointer() checks tagged pointers which have no addresses
    that is valid within slab pages:

    BUG radix_tree_node (Tainted: G B ): Freelist Pointer check fails
    -----------------------------------------------------------------------------

    INFO: Slab objects=69 used=69 fp=0x (null) flags=0x7ffffffc000200
    INFO: Object @offset=15060037153926966016 fp=0x

    Redzone: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 18 6b 06 00 08 80 ff d0 .........k......
    Object : 18 6b 06 00 08 80 ff d0 00 00 00 00 00 00 00 00 .k..............
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Redzone: bb bb bb bb bb bb bb bb ........
    Padding: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
    CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B 5.0.0-rc5+ #18
    Call trace:
    dump_backtrace+0x0/0x450
    show_stack+0x20/0x2c
    __dump_stack+0x20/0x28
    dump_stack+0xa0/0xfc
    print_trailer+0x1bc/0x1d0
    object_err+0x40/0x50
    alloc_debug_processing+0xf0/0x19c
    ___slab_alloc+0x554/0x704
    kmem_cache_alloc+0x2f8/0x440
    radix_tree_node_alloc+0x90/0x2fc
    idr_get_free+0x1e8/0x6d0
    idr_alloc_u32+0x11c/0x2a4
    idr_alloc+0x74/0xe0
    worker_pool_assign_id+0x5c/0xbc
    workqueue_init_early+0x49c/0xd50
    start_kernel+0x52c/0xac4
    FIX radix_tree_node: Marking all objects used

    Link: http://lkml.kernel.org/r/20190209044128.3290-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Reviewed-by: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • When CONFIG_KASAN_SW_TAGS is enabled, ptr_addr might be tagged. Normally,
    this doesn't cause any issues, as both set_freepointer() and
    get_freepointer() are called with a pointer with the same tag. However,
    there are some issues with CONFIG_SLUB_DEBUG code. For example, when
    __free_slub() iterates over objects in a cache, it passes untagged
    pointers to check_object(). check_object() in turns calls
    get_freepointer() with an untagged pointer, which causes the freepointer
    to be restored incorrectly.

    Add kasan_reset_tag to freelist_ptr(). Also add a detailed comment.

    Link: http://lkml.kernel.org/r/bf858f26ef32eb7bd24c665755b3aee4bc58d0e4.1550103861.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reported-by: Qian Cai
    Tested-by: Qian Cai
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • CONFIG_SLAB_FREELIST_HARDENED hashes freelist pointer with the address of
    the object where the pointer gets stored. With tag based KASAN we don't
    account for that when building freelist, as we call set_freepointer() with
    the first argument untagged. This patch changes the code to properly
    propagate tags throughout the loop.

    Link: http://lkml.kernel.org/r/3df171559c52201376f246bf7ce3184fe21c1dc7.1549921721.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reported-by: Qian Cai
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Catalin Marinas
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vincenzo Frascino
    Cc: Kostya Serebryany
    Cc: Evgeniy Stepanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • With tag based KASAN page_address() looks at the page flags to see whether
    the resulting pointer needs to have a tag set. Since we don't want to set
    a tag when page_address() is called on SLAB pages, we call
    page_kasan_tag_reset() in kasan_poison_slab(). However in allocate_slab()
    page_address() is called before kasan_poison_slab(). Fix it by changing
    the order.

    [andreyknvl@google.com: fix compilation error when CONFIG_SLUB_DEBUG=n]
    Link: http://lkml.kernel.org/r/ac27cc0bbaeb414ed77bcd6671a877cf3546d56e.1550066133.git.andreyknvl@google.com
    Link: http://lkml.kernel.org/r/cd895d627465a3f1c712647072d17f10883be2a1.1549921721.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Catalin Marinas
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dmitry Vyukov
    Cc: Evgeniy Stepanov
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Pekka Enberg
    Cc: Qian Cai
    Cc: Vincenzo Frascino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • kmemleak keeps two global variables, min_addr and max_addr, which store
    the range of valid (encountered by kmemleak) pointer values, which it
    later uses to speed up pointer lookup when scanning blocks.

    With tagged pointers this range will get bigger than it needs to be. This
    patch makes kmemleak untag pointers before saving them to min_addr and
    max_addr and when performing a lookup.

    Link: http://lkml.kernel.org/r/16e887d442986ab87fe87a755815ad92fa431a5f.1550066133.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Tested-by: Qian Cai
    Acked-by: Catalin Marinas
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dmitry Vyukov
    Cc: Evgeniy Stepanov
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Pekka Enberg
    Cc: Vincenzo Frascino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Right now we call kmemleak hooks before assigning tags to pointers in
    KASAN hooks. As a result, when an objects gets allocated, kmemleak sees a
    differently tagged pointer, compared to the one it sees when the object
    gets freed. Fix it by calling KASAN hooks before kmemleak's ones.

    Link: http://lkml.kernel.org/r/cd825aa4897b0fc37d3316838993881daccbe9f5.1549921721.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reported-by: Qian Cai
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Catalin Marinas
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dmitry Vyukov
    Cc: Evgeniy Stepanov
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Pekka Enberg
    Cc: Vincenzo Frascino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

09 Jan, 2019

1 commit

  • With CONFIG_HARDENED_USERCOPY enabled __check_heap_object() compares and
    then subtracts a potentially tagged pointer with a non-tagged address of
    the page that this pointer belongs to, which leads to unexpected
    behavior.

    Untag the pointer in __check_heap_object() before doing any of these
    operations.

    Link: http://lkml.kernel.org/r/7e756a298d514c4482f52aea6151db34818d395d.1546540962.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Dmitry Vyukov
    Cc: Mark Rutland
    Cc: Vincenzo Frascino
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

29 Dec, 2018

4 commits

  • If __cmpxchg_double_slab() fails and (l != m), current code records
    transition states of slub action.

    Update the action after __cmpxchg_double_slab() success to record the
    final state.

    [akpm@linux-foundation.org: more whitespace cleanup]
    Link: http://lkml.kernel.org/r/20181107013119.3816-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • node_match() is a static function and is only invoked in slub.c.

    In all three places, `page' is ensured to be valid.

    Link: http://lkml.kernel.org/r/20181106150245.1668-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • cpu_slab is a per cpu variable which is allocated in all or none. If a
    cpu_slab failed to be allocated, the slub is not usable.

    We could use cpu_slab without validation in __flush_cpu_slab().

    Link: http://lkml.kernel.org/r/20181103141218.22844-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • An object constructor can initialize pointers within this objects based on
    the address of the object. Since the object address might be tagged, we
    need to assign a tag before calling constructor.

    The implemented approach is to assign tags to objects with constructors
    when a slab is allocated and call constructors once as usual. The
    downside is that such object would always have the same tag when it is
    reallocated, so we won't catch use-after-frees on it.

    Also pressign tags for objects from SLAB_TYPESAFE_BY_RCU caches, since
    they can be validy accessed after having been freed.

    Link: http://lkml.kernel.org/r/f158a8a74a031d66f0a9398a5b0ed453c37ba09a.1544099024.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Mark Rutland
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov