29 Jul, 2020

1 commit

  • commit d38a2b7a9c939e6d7329ab92b96559ccebf7b135 upstream.

    If the kmem_cache refcount is greater than one, we should not mark the
    root kmem_cache as dying. If we mark the root kmem_cache dying
    incorrectly, the non-root kmem_cache can never be destroyed. It
    resulted in memory leak when memcg was destroyed. We can use the
    following steps to reproduce.

    1) Use kmem_cache_create() to create a new kmem_cache named A.
    2) Coincidentally, the kmem_cache A is an alias for kmem_cache B,
    so the refcount of B is just increased.
    3) Use kmem_cache_destroy() to destroy the kmem_cache A, just
    decrease the B's refcount but mark the B as dying.
    4) Create a new memory cgroup and alloc memory from the kmem_cache
    B. It leads to create a non-root kmem_cache for allocating memory.
    5) When destroy the memory cgroup created in the step 4), the
    non-root kmem_cache can never be destroyed.

    If we repeat steps 4) and 5), this will cause a lot of memory leak. So
    only when refcount reach zero, we mark the root kmem_cache as dying.

    Fixes: 92ee383f6daa ("mm: fix race between kmem_cache destroy, create and deactivate")
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Shakeel Butt
    Cc:
    Link: http://lkml.kernel.org/r/20200716165103.83462-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Muchun Song
     

01 Jul, 2020

1 commit

  • commit 8982ae527fbef170ef298650c15d55a9ccd33973 upstream.

    The kzfree() function is normally used to clear some sensitive
    information, like encryption keys, in the buffer before freeing it back to
    the pool. Memset() is currently used for buffer clearing. However
    unlikely, there is still a non-zero probability that the compiler may
    choose to optimize away the memory clearing especially if LTO is being
    used in the future.

    To make sure that this optimization will never happen,
    memzero_explicit(), which is introduced in v3.18, is now used in
    kzfree() to future-proof it.

    Link: http://lkml.kernel.org/r/20200616154311.12314-2-longman@redhat.com
    Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()")
    Signed-off-by: Waiman Long
    Acked-by: Michal Hocko
    Cc: David Howells
    Cc: Jarkko Sakkinen
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Joe Perches
    Cc: Matthew Wilcox
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Dan Carpenter
    Cc: "Jason A . Donenfeld"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     

23 Jan, 2020

1 commit

  • commit 2fe20210fc5f5e62644678b8f927c49f2c6f42a7 upstream.

    When booting with amd_iommu=off, the following WARNING message
    appears:

    AMD-Vi: AMD IOMMU disabled on kernel command-line
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
    Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
    RIP: 0010:flush_workqueue+0x42e/0x450
    Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
    Call Trace:
    kmem_cache_destroy+0x69/0x260
    iommu_go_to_state+0x40c/0x5ab
    amd_iommu_prepare+0x16/0x2a
    irq_remapping_prepare+0x36/0x5f
    enable_IR_x2apic+0x21/0x172
    default_setup_apic_routing+0x12/0x6f
    apic_intr_mode_init+0x1a1/0x1f1
    x86_late_time_init+0x17/0x1c
    start_kernel+0x480/0x53f
    secondary_startup_64+0xb6/0xc0
    ---[ end trace 30894107c3749449 ]---
    x2apic: IRQ remapping doesn't support X2APIC mode
    x2apic disabled

    The warning is caused by the calling of 'kmem_cache_destroy()'
    in free_iommu_resources(). Here is the call path:

    free_iommu_resources
    kmem_cache_destroy
    flush_memcg_workqueue
    flush_workqueue

    The root cause is that the IOMMU subsystem runs before the workqueue
    subsystem, which the variable 'wq_online' is still 'false'. This leads
    to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
    'true'.

    Since the variable 'memcg_kmem_cache_wq' is not allocated during the
    time, it is unnecessary to call flush_memcg_workqueue(). This prevents
    the WARNING message triggered by flush_workqueue().

    Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
    Fixes: 92ee383f6daab ("mm: fix race between kmem_cache destroy, create and deactivate")
    Signed-off-by: Adrian Huang
    Reported-by: Xiaochun Lee
    Reviewed-by: Shakeel Butt
    Cc: Joerg Roedel
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Adrian Huang
     

18 Dec, 2019

1 commit

  • commit a264df74df38855096393447f1b8f386069a94b9 upstream.

    Christian reported a warning like the following obtained during running
    some KVM-related tests on s390:

    WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
    Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
    CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
    Hardware name: IBM 2964 NC9 712 (LPAR)
    Workqueue: events sysfs_slab_remove_workfn
    Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
    Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
    0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
    0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
    0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
    Krnl Code: 0000001529746842: f0a0000407fe srp 4(11,%r0),2046,0
    0000001529746848: 47000700 bc 0,1792
    #000000152974684c: a7f40001 brc 15,152974684e
    >0000001529746850: a7f4fff2 brc 15,1529746834
    0000001529746854: 0707 bcr 0,%r7
    0000001529746856: 0707 bcr 0,%r7
    0000001529746858: eb8ff0580024 stmg %r8,%r15,88(%r15)
    000000152974685e: a738ffff lhi %r3,-1
    Call Trace:
    ([] 0x3e009263d00)
    [] slab_kmem_cache_release+0x3a/0x70
    [] kobject_put+0xaa/0xe8
    [] process_one_work+0x1e8/0x428
    [] worker_thread+0x48/0x460
    [] kthread+0x126/0x160
    [] ret_from_fork+0x28/0x30
    [] kernel_thread_starter+0x0/0x10
    Last Breaking-Event-Address:
    [] percpu_ref_exit+0x4c/0x58
    ---[ end trace b035e7da5788eb09 ]---

    The problem occurs because kmem_cache_destroy() is called immediately
    after deleting of a memcg, so it races with the memcg kmem_cache
    deactivation.

    flush_memcg_workqueue() at the beginning of kmem_cache_destroy() is
    supposed to guarantee that all deactivation processes are finished, but
    failed to do so. It waits for an rcu grace period, after which all
    children kmem_caches should be deactivated. During the deactivation
    percpu_ref_kill() is called for non root kmem_cache refcounters, but it
    requires yet another rcu grace period to finish the transition to the
    atomic (dead) state.

    So in a rare case when not all children kmem_caches are destroyed at the
    moment when the root kmem_cache is about to be gone, we need to wait
    another rcu grace period before destroying the root kmem_cache.

    This issue can be triggered only with dynamically created kmem_caches
    which are used with memcg accounting. In this case per-memcg child
    kmem_caches are created. They are deactivated from the cgroup removing
    path. If the destruction of the root kmem_cache is racing with the
    removal of the cgroup (both are quite complicated multi-stage
    processes), the described issue can occur. The only known way to
    trigger it in the real life, is to unload some kernel module which
    creates a dedicated kmem_cache, used from different memory cgroups with
    GFP_ACCOUNT flag. If the unloading happens immediately after calling
    rmdir on the corresponding cgroup, there is some chance to trigger the
    issue.

    Link: http://lkml.kernel.org/r/20191129025011.3076017-1-guro@fb.com
    Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
    Signed-off-by: Roman Gushchin
    Reported-by: Christian Borntraeger
    Tested-by: Christian Borntraeger
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

19 Oct, 2019

1 commit

  • Karsten reported the following panic in __free_slab() happening on a s390x
    machine:

    Unable to handle kernel pointer dereference in virtual kernel address space
    Failing address: 0000000000000000 TEID: 0000000000000483
    Fault in home space mode while using kernel ASCE.
    AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
    Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
    Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
    Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
    Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
    Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
    0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
    0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
    0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
    000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
    Krnl Code: 00000000003cada6: e310a1400004 lg %r1,320(%r10)
    00000000003cadac: c0e50046c286 brasl %r14,ca32b8
    #00000000003cadb2: a7f4fe36 brc 15,3caa1e
    >00000000003cadb6: e32060800024 stg %r2,128(%r6)
    00000000003cadbc: a7f4fd9e brc 15,3ca8f8
    00000000003cadc0: c0e50046790c brasl %r14,c99fd8
    00000000003cadc6: a7f4fe2c brc 15,3caa
    00000000003cadc6: a7f4fe2c brc 15,3caa1e
    00000000003cadca: ecb1ffff00d9 aghik %r11,%r1,-1
    Call Trace:
    ( __free_slab+0x49c/0x6b0)
    rcu_core+0x5a6/0x7e0
    __do_softirq+0xf2/0x5c0
    irq_exit+0x104/0x130
    do_IRQ+0x9a/0xf0
    ext_int_handler+0x130/0x134
    enabled_wait+0x58/0x128
    ( enabled_wait+0x44/0x128)
    arch_cpu_idle+0x40/0x58
    default_idle_call+0x3c/0x68
    do_idle+0xec/0x1c0
    cpu_startup_entry+0x36/0x40
    arch_call_rest_init+0x5c/0x88
    0x0
    INFO: lockdep is turned off.
    Last Breaking-Event-Address:
    __free_slab+0x1c4/0x6b0
    Kernel panic - not syncing: Fatal exception in interrupt

    The kernel panics on an attempt to dereference the NULL memcg pointer.
    When shutdown_cache() is called from the kmem_cache_destroy() context, a
    memcg kmem_cache might have empty slab pages in a partial list, which are
    still charged to the memory cgroup.

    These pages are released by free_partial() at the beginning of
    shutdown_cache(): either directly or by scheduling a RCU-delayed work
    (if the kmem_cache has the SLAB_TYPESAFE_BY_RCU flag). The latter case
    is when the reported panic can happen: memcg_unlink_cache() is called
    immediately after shrinking partial lists, without waiting for scheduled
    RCU works. It sets the kmem_cache->memcg_params.memcg pointer to NULL,
    and the following attempt to dereference it by __free_slab() from the
    RCU work context causes the panic.

    To fix the issue, let's postpone the release of the memcg pointer to
    destroy_memcg_params(). It's called from a separate work context by
    slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier.
    This guarantees that all scheduled page release RCU works will complete
    before the memcg pointer will be zeroed.

    Big thanks for Karsten for the perfect report containing all necessary
    information, his help with the analysis of the problem and testing of the
    fix.

    Link: http://lkml.kernel.org/r/20191010160549.1584316-1-guro@fb.com
    Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
    Signed-off-by: Roman Gushchin
    Reported-by: Karsten Graul
    Tested-by: Karsten Graul
    Acked-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Karsten Graul
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

08 Oct, 2019

2 commits

  • In most configurations, kmalloc() happens to return naturally aligned
    (i.e. aligned to the block size itself) blocks for power of two sizes.

    That means some kmalloc() users might unknowingly rely on that
    alignment, until stuff breaks when the kernel is built with e.g.
    CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
    developers have to devise workaround such as own kmem caches with
    specified alignment [1], which is not always practical, as recently
    evidenced in [2].

    The topic has been discussed at LSF/MM 2019 [3]. Adding a
    'kmalloc_aligned()' variant would not help with code unknowingly relying
    on the implicit alignment. For slab implementations it would either
    require creating more kmalloc caches, or allocate a larger size and only
    give back part of it. That would be wasteful, especially with a generic
    alignment parameter (in contrast with a fixed alignment to size).

    Ideally we should provide to mm users what they need without difficult
    workarounds or own reimplementations, so let's make the kmalloc()
    alignment to size explicitly guaranteed for power-of-two sizes under all
    configurations. What this means for the three available allocators?

    * SLAB object layout happens to be mostly unchanged by the patch. The
    implicitly provided alignment could be compromised with
    CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
    caches with alignment larger than unsigned long long. Practically on at
    least x86 this includes kmalloc caches as they use cache line alignment,
    which is larger than that. Still, this patch ensures alignment on all
    arches and cache sizes.

    * SLUB layout is also unchanged unless redzoning is enabled through
    CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
    With this patch, explicit alignment is guaranteed with redzoning as
    well. This will result in more memory being wasted, but that should be
    acceptable in a debugging scenario.

    * SLOB has no implicit alignment so this patch adds it explicitly for
    kmalloc(). The potential downside is increased fragmentation. While
    pathological allocation scenarios are certainly possible, in my testing,
    after booting a x86_64 kernel+userspace with virtme, around 16MB memory
    was consumed by slab pages both before and after the patch, with
    difference in the noise.

    [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
    [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
    [3] https://lwn.net/Articles/787740/

    [akpm@linux-foundation.org: documentation fixlet, per Matthew]
    Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Acked-by: Christoph Hellwig
    Cc: David Sterba
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Ming Lei
    Cc: Dave Chinner
    Cc: "Darrick J . Wong"
    Cc: Christoph Hellwig
    Cc: James Bottomley
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "guarantee natural alignment for kmalloc()", v2.

    This patch (of 2):

    SLOB currently doesn't account its pages at all, so in /proc/meminfo the
    Slab field shows zero. Modifying a counter on page allocation and
    freeing should be acceptable even for the small system scenarios SLOB is
    intended for. Since reclaimable caches are not separated in SLOB,
    account everything as unreclaimable.

    SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
    larger than order-1 page, that are passed directly to the page
    allocator. As they also don't appear in /proc/slabinfo, it might look
    like a memory leak. For consistency, account them as well. (SLAB
    doesn't actually use page allocator directly, so no change there).

    Ideally SLOB and SLUB would be handled in separate patches, but due to
    the shared kmalloc_order() function and different kfree()
    implementations, it's easier to patch both at once to prevent
    inconsistencies.

    Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Ming Lei
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: "Darrick J . Wong"
    Cc: Christoph Hellwig
    Cc: James Bottomley
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

25 Sep, 2019

1 commit

  • Currently, a value of '1" is written to /sys/kernel/slab//shrink
    file to shrink the slab by flushing out all the per-cpu slabs and free
    slabs in partial lists. This can be useful to squeeze out a bit more
    memory under extreme condition as well as making the active object counts
    in /proc/slabinfo more accurate.

    This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
    option is usually not enabled and "slub_memcg_sysfs=1" not set. Even if
    memcg sysfs is turned on, it is too cumbersome and impractical to manage
    all those per-memcg sysfs files in a real production system.

    So there is no practical way to shrink memcg caches. Fix this by enabling
    a proper write to the shrink sysfs file of the root cache to scan all the
    available memcg caches and shrink them as well. For a non-root memcg
    cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is on), only that
    cache will be shrunk when written.

    On a 2-socket 64-core 256-thread arm64 system with 64k page after
    a parallel kernel build, the the amount of memory occupied by slabs
    before shrinking slabs were:

    # grep task_struct /proc/slabinfo
    task_struct 53137 53192 4288 61 4 : tunables 0 0
    0 : slabdata 872 872 0
    # grep "^S[lRU]" /proc/meminfo
    Slab: 3936832 kB
    SReclaimable: 399104 kB
    SUnreclaim: 3537728 kB

    After shrinking slabs (by echoing "1" to all shrink files):

    # grep "^S[lRU]" /proc/meminfo
    Slab: 1356288 kB
    SReclaimable: 263296 kB
    SUnreclaim: 1092992 kB
    # grep task_struct /proc/slabinfo
    task_struct 2764 6832 4288 61 4 : tunables 0 0
    0 : slabdata 112 112 0

    Link: http://lkml.kernel.org/r/20190723151445.7385-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Acked-by: Roman Gushchin
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     

17 Jul, 2019

1 commit

  • Clang gets rather confused about two variables in the same special
    section when one of them is not initialized, leading to an assembler
    warning later:

    /tmp/slab_common-18f869.s: Assembler messages:
    /tmp/slab_common-18f869.s:7526: Warning: ignoring changed section attributes for .data..ro_after_init

    Adding an initialization to kmalloc_caches is rather silly here
    but does avoid the issue.

    Link: https://bugs.llvm.org/show_bug.cgi?id=42570
    Link: http://lkml.kernel.org/r/20190712090455.266021-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: David Rientjes
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Stephen Rothwell
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Vladimir Davydov
    Cc: Andrey Konovalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

13 Jul, 2019

10 commits

  • There are concerns about memory leaks from extensive use of memory cgroups
    as each memory cgroup creates its own set of kmem caches. There is a
    possiblity that the memcg kmem caches may remain even after the memory
    cgroups have been offlined. Therefore, it will be useful to show the
    status of each of memcg kmem caches.

    This patch introduces a new /memcg_slabinfo file which is
    somewhat similar to /proc/slabinfo in format, but lists only information
    about kmem caches that have child memcg kmem caches. Information
    available in /proc/slabinfo are not repeated in memcg_slabinfo.

    A portion of a sample output of the file was:

    #
    rpc_inode_cache root 13 51 1 1
    rpc_inode_cache 48 0 0 0 0
    fat_inode_cache root 1 45 1 1
    fat_inode_cache 41 2 45 1 1
    xfs_inode root 770 816 24 24
    xfs_inode 92 22 34 1 1
    xfs_inode 88:dead 1 34 1 1
    xfs_inode 89:dead 23 34 1 1
    xfs_inode 85 4 34 1 1
    xfs_inode 84 9 34 1 1

    The css id of the memcg is also listed. If a memcg is not online,
    the tag ":dead" will be attached as shown above.

    [longman@redhat.com: memcg: add ":deact" tag for reparented kmem caches in memcg_slabinfo]
    Link: http://lkml.kernel.org/r/20190621173005.31514-1-longman@redhat.com
    [longman@redhat.com: set the flag in the common code as suggested by Roman]
    Link: http://lkml.kernel.org/r/20190627184324.5875-1-longman@redhat.com
    Link: http://lkml.kernel.org/r/20190619171621.26209-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Suggested-by: Shakeel Butt
    Reviewed-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Let's reparent non-root kmem_caches on memcg offlining. This allows us to
    release the memory cgroup without waiting for the last outstanding kernel
    object (e.g. dentry used by another application).

    Since the parent cgroup is already charged, everything we need to do is to
    splice the list of kmem_caches to the parent's kmem_caches list, swap the
    memcg pointer, drop the css refcounter for each kmem_cache and adjust the
    parent's css refcounter.

    Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer
    anymore. It's safe to read it under rcu_read_lock(), cgroup_mutex held,
    or any other way that protects the memory cgroup from being released.

    We can race with the slab allocation and deallocation paths. It's not a
    big problem: parent's charge and slab global stats are always correct, and
    we don't care anymore about the child usage and global stats. The child
    cgroup is already offline, so we don't use or show it anywhere.

    Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't
    used anywhere except count_shadow_nodes(). But even there it won't break
    anything: after reparenting "nodes" will be 0 on child level (because
    we're already reparenting shrinker lists), and on parent level page stats
    always were 0, and this patch won't change anything.

    [guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup]
    Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently each charged slab page holds a reference to the cgroup to which
    it's charged. Kmem_caches are held by the memcg and are released all
    together with the memory cgroup. It means that none of kmem_caches are
    released unless at least one reference to the memcg exists, which is very
    far from optimal.

    Let's rework it in a way that allows releasing individual kmem_caches as
    soon as the cgroup is offline, the kmem_cache is empty and there are no
    pending allocations.

    To make it possible, let's introduce a new percpu refcounter for non-root
    kmem caches. The counter is initialized to the percpu mode, and is
    switched to the atomic mode during kmem_cache deactivation. The counter
    is bumped for every charged page and also for every running allocation.
    So the kmem_cache can't be released unless all allocations complete.

    To shutdown non-active empty kmem_caches, let's reuse the work queue,
    previously used for the kmem_cache deactivation. Once the reference
    counter reaches 0, let's schedule an asynchronous kmem_cache release.

    * I used the following simple approach to test the performance
    (stolen from another patchset by T. Harding):

    time find / -name fname-no-exist
    echo 2 > /proc/sys/vm/drop_caches
    repeat 10 times

    Results:

    orig patched

    real 0m1.455s real 0m1.355s
    user 0m0.206s user 0m0.219s
    sys 0m0.855s sys 0m0.807s

    real 0m1.487s real 0m1.699s
    user 0m0.221s user 0m0.256s
    sys 0m0.806s sys 0m0.948s

    real 0m1.515s real 0m1.505s
    user 0m0.183s user 0m0.215s
    sys 0m0.876s sys 0m0.858s

    real 0m1.291s real 0m1.380s
    user 0m0.193s user 0m0.198s
    sys 0m0.843s sys 0m0.786s

    real 0m1.364s real 0m1.374s
    user 0m0.180s user 0m0.182s
    sys 0m0.868s sys 0m0.806s

    real 0m1.352s real 0m1.312s
    user 0m0.201s user 0m0.212s
    sys 0m0.820s sys 0m0.761s

    real 0m1.302s real 0m1.349s
    user 0m0.205s user 0m0.203s
    sys 0m0.803s sys 0m0.792s

    real 0m1.334s real 0m1.301s
    user 0m0.194s user 0m0.201s
    sys 0m0.806s sys 0m0.779s

    real 0m1.426s real 0m1.434s
    user 0m0.216s user 0m0.181s
    sys 0m0.824s sys 0m0.864s

    real 0m1.350s real 0m1.295s
    user 0m0.200s user 0m0.190s
    sys 0m0.842s sys 0m0.811s

    So it looks like the difference is not noticeable in this test.

    [cai@lca.pw: fix an use-after-free in kmemcg_workfn()]
    Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw
    Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.com
    Signed-off-by: Roman Gushchin
    Signed-off-by: Qian Cai
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Shakeel Butt
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently the memcg_params.dying flag and the corresponding workqueue used
    for the asynchronous deactivation of kmem_caches is synchronized using the
    slab_mutex.

    It makes impossible to check this flag from the irq context, which will be
    required in order to implement asynchronous release of kmem_caches.

    So let's switch over to the irq-save flavor of the spinlock-based
    synchronization.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-8-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • There is no point in checking the root_cache->memcg_params.dying flag on
    kmem_cache creation path. New allocations shouldn't be performed using a
    dead root kmem_cache, so no new memcg kmem_cache creation can be scheduled
    after the flag is set. And if it was scheduled before,
    flush_memcg_workqueue() will wait for it anyway.

    So let's drop this check to simplify the code.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-7-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently SLUB uses a work scheduled after an RCU grace period to
    deactivate a non-root kmem_cache. This mechanism can be reused for
    kmem_caches release, but requires generalization for SLAB case.

    Introduce kmemcg_cache_deactivate() function, which calls
    allocator-specific __kmem_cache_deactivate() and schedules execution of
    __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
    context after an rcu grace period.

    Here is the new calling scheme:
    kmemcg_cache_deactivate()
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    __kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific

    instead of:
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    slab_deactivate_memcg_cache_rcu_sched() SLUB-only
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    kmemcg_cache_deact_after_rcu() SLUB-only

    For consistency, all allocator-specific functions start with "__".

    Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The delayed work/rcu deactivation infrastructure of non-root kmem_caches
    can be also used for asynchronous release of these objects. Let's get rid
    of the word "deactivation" in corresponding names to make the code look
    better after generalization.

    It's easier to make the renaming first, so that the generalized code will
    look consistent from scratch.

    Let's rename struct memcg_cache_params fields:
    deact_fn -> work_fn
    deact_rcu_head -> rcu_head
    deact_work -> work

    And RCU/delayed work callbacks in slab common code:
    kmemcg_deactivate_rcufn -> kmemcg_rcufn
    kmemcg_deactivate_workfn -> kmemcg_workfn

    This patch contains no functional changes, only renamings.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-3-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "mm: reparent slab memory on cgroup removal", v7.

    # Why do we need this?

    We've noticed that the number of dying cgroups is steadily growing on most
    of our hosts in production. The following investigation revealed an issue
    in the userspace memory reclaim code [1], accounting of kernel stacks [2],
    and also the main reason: slab objects.

    The underlying problem is quite simple: any page charged to a cgroup holds
    a reference to it, so the cgroup can't be reclaimed unless all charged
    pages are gone. If a slab object is actively used by other cgroups, it
    won't be reclaimed, and will prevent the origin cgroup from being
    reclaimed.

    Slab objects, and first of all vfs cache, is shared between cgroups, which
    are using the same underlying fs, and what's even more important, it's
    shared between multiple generations of the same workload. So if something
    is running periodically every time in a new cgroup (like how systemd
    works), we do accumulate multiple dying cgroups.

    Strictly speaking pagecache isn't different here, but there is a key
    difference: we disable protection and apply some extra pressure on LRUs of
    dying cgroups, and these LRUs contain all charged pages. My experiments
    show that with the disabled kernel memory accounting the number of dying
    cgroups stabilizes at a relatively small number (~100, depends on memory
    pressure and cgroup creation rate), and with kernel memory accounting it
    grows pretty steadily up to several thousands.

    Memory cgroups are quite complex and big objects (mostly due to percpu
    stats), so it leads to noticeable memory losses. Memory occupied by dying
    cgroups is measured in hundreds of megabytes. I've even seen a host with
    more than 100Gb of memory wasted for dying cgroups. It leads to a
    degradation of performance with the uptime, and generally limits the usage
    of cgroups.

    My previous attempt [3] to fix the problem by applying extra pressure on
    slab shrinker lists caused a regressions with xfs and ext4, and has been
    reverted [4]. The following attempts to find the right balance [5, 6]
    were not successful.

    So instead of trying to find a maybe non-existing balance, let's do
    reparent accounted slab caches to the parent cgroup on cgroup removal.

    # Implementation approach

    There is however a significant problem with reparenting of slab memory:
    there is no list of charged pages. Some of them are in shrinker lists,
    but not all. Introducing of a new list is really not an option.

    But fortunately there is a way forward: every slab page has a stable
    pointer to the corresponding kmem_cache. So the idea is to reparent
    kmem_caches instead of slab pages.

    It's actually simpler and cheaper, but requires some underlying changes:
    1) Make kmem_caches to hold a single reference to the memory cgroup,
    instead of a separate reference per every slab page.
    2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
    page->kmem_cache->memcg indirection instead. It's used only on
    slab page release, so performance overhead shouldn't be a big issue.
    3) Introduce a refcounter for non-root slab caches. It's required to
    be able to destroy kmem_caches when they become empty and release
    the associated memory cgroup.

    There is a bonus: currently we release all memcg kmem_caches all together
    with the memory cgroup itself. This patchset allows individual
    kmem_caches to be released as soon as they become inactive and free.

    Some additional implementation details are provided in corresponding
    commit messages.

    # Results

    Below is the average number of dying cgroups on two groups of our
    production hosts. They do run some sort of web frontend workload, the
    memory pressure is moderate. As we can see, with the kernel memory
    reparenting the number stabilizes in 60s range; however with the original
    version it grows almost linearly and doesn't show any signs of plateauing.
    The difference in slab and percpu usage between patched and unpatched
    versions also grows linearly. In 7 days it exceeded 200Mb.

    day 0 1 2 3 4 5 6 7
    original 56 362 628 752 1070 1250 1490 1560
    patched 23 46 51 55 60 57 67 69
    mem diff(Mb) 22 74 123 152 164 182 214 241

    # Links

    [1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error")
    [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
    [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
    [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
    [5]: https://lkml.org/lkml/2019/1/28/1865
    [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2

    This patch (of 10):

    Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
    rather than in init_memcg_params().

    Once kmem_cache will hold a reference to the memory cgroup, it will
    simplify the refcounting.

    For non-root kmem_caches memcg_link_cache() is always called before the
    kmem_cache becomes visible to a user, so it's safe.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Waiman Long
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • ksize() has been unconditionally unpoisoning the whole shadow memory
    region associated with an allocation. This can lead to various undetected
    bugs, for example, double-kzfree().

    Specifically, kzfree() uses ksize() to determine the actual allocation
    size, and subsequently zeroes the memory. Since ksize() used to just
    unpoison the whole shadow memory region, no invalid free was detected.

    This patch addresses this as follows:

    1. Add a check in ksize(), and only then unpoison the memory region.

    2. Preserve kasan_unpoison_slab() semantics by explicitly unpoisoning
    the shadow memory region using the size obtained from __ksize().

    Tested:
    1. With SLAB allocator: a) normal boot without warnings; b) verified the
    added double-kzfree() is detected.
    2. With SLUB allocator: a) normal boot without warnings; b) verified the
    added double-kzfree() is detected.

    [elver@google.com: s/BUG_ON/WARN_ON_ONCE/, per Kees]
    Link: http://lkml.kernel.org/r/20190627094445.216365-6-elver@google.com
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199359
    Link: http://lkml.kernel.org/r/20190626142014.141844-6-elver@google.com
    Signed-off-by: Marco Elver
    Acked-by: Kees Cook
    Reviewed-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marco Elver
     
  • This refactors common code of ksize() between the various allocators into
    slab_common.c: __ksize() is the allocator-specific implementation without
    instrumentation, whereas ksize() includes the required KASAN logic.

    Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.com
    Signed-off-by: Marco Elver
    Acked-by: Christoph Lameter
    Reviewed-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mark Rutland
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marco Elver
     

30 Mar, 2019

1 commit

  • Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
    v6.

    This is a followup to the discussion in [1], [2].

    IOMMUs using ARMv7 short-descriptor format require page tables (level 1
    and 2) to be allocated within the first 4GB of RAM, even on 64-bit
    systems.

    For L1 tables that are bigger than a page, we can just use
    __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
    use GFP_DMA).

    For L2 tables that only take 1KB, it would be a waste to allocate a full
    page, so we considered 3 approaches:
    1. This series, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2 page
    tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable to reuse
    freed fragments until the whole page is freed. [3]

    This series is the most memory-efficient approach.

    stable@ note:
    We confirmed that this is a regression, and IOMMU errors happen on 4.19
    and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
    most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
    with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
    platforms (and maybe others?).

    [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
    [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
    [3] https://patchwork.codeaurora.org/patch/671639/

    This patch (of 3):

    IOMMUs using ARMv7 short-descriptor format require page tables to be
    allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
    this is done by passing GFP_DMA32 flag to memory allocation functions.

    For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
    a full page using get_free_pages, so we considered 3 approaches:
    1. This patch, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2
    page tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable
    to reuse freed fragments until the whole page is freed.

    This change makes it possible to create a custom cache in DMA32 zone using
    kmem_cache_create, then allocate memory using kmem_cache_alloc.

    We do not create a DMA32 kmalloc cache array, as there are currently no
    users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
    warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.

    This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
    kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
    unnecessary).

    Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
    Signed-off-by: Nicolas Boichat
    Acked-by: Vlastimil Babka
    Acked-by: Will Deacon
    Cc: Robin Murphy
    Cc: Joerg Roedel
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Sasha Levin
    Cc: Huaisheng Ye
    Cc: Mike Rapoport
    Cc: Yong Wu
    Cc: Matthias Brugger
    Cc: Tomasz Figa
    Cc: Yingjoe Chen
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Hsin-Yi Wang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Boichat
     

06 Mar, 2019

2 commits

  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • This is the start of a series of patches similar to my earlier
    DEFINE_MEMCG_MAX_OR_VAL work, but with less Macro Magic(tm).

    There are a bunch of places we go from seq_file to mem_cgroup, which
    currently requires manually getting the css, then getting the mem_cgroup
    from the css. It's in enough places now that having mem_cgroup_from_seq
    makes sense (and also makes the next patch a bit nicer).

    Link: http://lkml.kernel.org/r/20190124194050.GA31341@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     

22 Feb, 2019

2 commits

  • kmemleak keeps two global variables, min_addr and max_addr, which store
    the range of valid (encountered by kmemleak) pointer values, which it
    later uses to speed up pointer lookup when scanning blocks.

    With tagged pointers this range will get bigger than it needs to be. This
    patch makes kmemleak untag pointers before saving them to min_addr and
    max_addr and when performing a lookup.

    Link: http://lkml.kernel.org/r/16e887d442986ab87fe87a755815ad92fa431a5f.1550066133.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Tested-by: Qian Cai
    Acked-by: Catalin Marinas
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dmitry Vyukov
    Cc: Evgeniy Stepanov
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Pekka Enberg
    Cc: Vincenzo Frascino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Right now we call kmemleak hooks before assigning tags to pointers in
    KASAN hooks. As a result, when an objects gets allocated, kmemleak sees a
    differently tagged pointer, compared to the one it sees when the object
    gets freed. Fix it by calling KASAN hooks before kmemleak's ones.

    Link: http://lkml.kernel.org/r/cd825aa4897b0fc37d3316838993881daccbe9f5.1549921721.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reported-by: Qian Cai
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Catalin Marinas
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dmitry Vyukov
    Cc: Evgeniy Stepanov
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Pekka Enberg
    Cc: Vincenzo Frascino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

30 Dec, 2018

1 commit

  • Pull documentation update from Jonathan Corbet:
    "A fairly normal cycle for documentation stuff. We have a new document
    on perf security, more Italian translations, more improvements to the
    memory-management docs, improvements to the pathname lookup
    documentation, and the usual array of smaller fixes.

    As is often the case, there are a few reaches outside of
    Documentation/ to adjust kerneldoc comments"

    * tag 'docs-5.0' of git://git.lwn.net/linux: (38 commits)
    docs: improve pathname-lookup document structure
    configfs: fix wrong name of struct in documentation
    docs/mm-api: link slab_common.c to "The Slab Cache" section
    slab: make kmem_cache_create{_usercopy} description proper kernel-doc
    doc:process: add links where missing
    docs/core-api: make mm-api.rst more structured
    x86, boot: documentation whitespace fixup
    Documentation: devres: note checking needs when converting
    doc:it: add some process/* translations
    doc:it: fixes in process/1.Intro
    Documentation: convert path-lookup from markdown to resturctured text
    Documentation/admin-guide: update admin-guide index.rst
    Documentation/admin-guide: introduce perf-security.rst file
    scripts/kernel-doc: Fix struct and struct field attribute processing
    Documentation: dev-tools: Fix typos in index.rst
    Correct gen_init_cpio tool's documentation
    Document /proc/pid PID reuse behavior
    Documentation: update path-lookup.md for parallel lookups
    Documentation: Use "while" instead of "whilst"
    dmaengine: Add mailing list address to the documentation
    ...

    Linus Torvalds
     

29 Dec, 2018

3 commits

  • WARN_ON() already contains an unlikely(), so it's not necessary to use
    unlikely.

    Also change WARN_ON() back to WARN_ON_ONCE() to avoid potentially
    spamming dmesg with user-triggerable large allocations.

    [akpm@linux-foundation.org: s/WARN_ON/WARN_ON_ONCE/, per Vlastimil]
    Link: http://lkml.kernel.org/r/20181104125028.3572-1-tiny.windzz@gmail.com
    Signed-off-by: Yangtao Li
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrew Morton
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yangtao Li
     
  • The krealloc function checks where the same buffer was reused or a new one
    allocated by comparing kernel pointers. Tag-based KASAN changes memory
    tag on the krealloc'ed chunk of memory and therefore also changes the
    pointer tag of the returned pointer. Therefore we need to perform
    comparison on untagged (with tags reset) pointers to check whether it's
    the same memory region or not.

    Link: http://lkml.kernel.org/r/14f6190d7846186a3506cd66d82446646fe65090.1544099024.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Mark Rutland
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Patch series "kasan: add software tag-based mode for arm64", v13.

    This patchset adds a new software tag-based mode to KASAN [1]. (Initially
    this mode was called KHWASAN, but it got renamed, see the naming rationale
    at the end of this section).

    The plan is to implement HWASan [2] for the kernel with the incentive,
    that it's going to have comparable to KASAN performance, but in the same
    time consume much less memory, trading that off for somewhat imprecise bug
    detection and being supported only for arm64.

    The underlying ideas of the approach used by software tag-based KASAN are:

    1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
    pointer tags in the top byte of each kernel pointer.

    2. Using shadow memory, we can store memory tags for each chunk of kernel
    memory.

    3. On each memory allocation, we can generate a random tag, embed it into
    the returned pointer and set the memory tags that correspond to this
    chunk of memory to the same value.

    4. By using compiler instrumentation, before each memory access we can add
    a check that the pointer tag matches the tag of the memory that is being
    accessed.

    5. On a tag mismatch we report an error.

    With this patchset the existing KASAN mode gets renamed to generic KASAN,
    with the word "generic" meaning that the implementation can be supported
    by any architecture as it is purely software.

    The new mode this patchset adds is called software tag-based KASAN. The
    word "tag-based" refers to the fact that this mode uses tags embedded into
    the top byte of kernel pointers and the TBI arm64 CPU feature that allows
    to dereference such pointers. The word "software" here means that shadow
    memory manipulation and tag checking on pointer dereference is done in
    software. As it is the only tag-based implementation right now, "software
    tag-based" KASAN is sometimes referred to as simply "tag-based" in this
    patchset.

    A potential expansion of this mode is a hardware tag-based mode, which
    would use hardware memory tagging support (announced by Arm [3]) instead
    of compiler instrumentation and manual shadow memory manipulation.

    Same as generic KASAN, software tag-based KASAN is strictly a debugging
    feature.

    [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html

    [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html

    [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a

    ====== Rationale

    On mobile devices generic KASAN's memory usage is significant problem.
    One of the main reasons to have tag-based KASAN is to be able to perform a
    similar set of checks as the generic one does, but with lower memory
    requirements.

    Comment from Vishwath Mohan :

    I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
    problematic to enable for environments that don't tolerate the increased
    memory pressure well. This includes

    (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
    (c) Connected components like Pixel's visual core [1].

    These are both places I'd love to have a low(er) memory footprint option at
    my disposal.

    Comment from Evgenii Stepanov :

    Looking at a live Android device under load, slab (according to
    /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
    overhead of 2x - 3x on top of it is not insignificant.

    Not having this overhead enables near-production use - ex. running
    KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
    not reproduce in test configuration. These are the ones that often cost
    the most engineering time to track down.

    CPU overhead is bad, but generally tolerable. RAM is critical, in our
    experience. Once it gets low enough, OOM-killer makes your life
    miserable.

    [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/

    ====== Technical details

    Software tag-based KASAN mode is implemented in a very similar way to the
    generic one. This patchset essentially does the following:

    1. TCR_TBI1 is set to enable Top Byte Ignore.

    2. Shadow memory is used (with a different scale, 1:16, so each shadow
    byte corresponds to 16 bytes of kernel memory) to store memory tags.

    3. All slab objects are aligned to shadow scale, which is 16 bytes.

    4. All pointers returned from the slab allocator are tagged with a random
    tag and the corresponding shadow memory is poisoned with the same value.

    5. Compiler instrumentation is used to insert tag checks. Either by
    calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
    CONFIG_KASAN_INLINE flags are reused).

    6. When a tag mismatch is detected in callback instrumentation mode
    KASAN simply prints a bug report. In case of inline instrumentation,
    clang inserts a brk instruction, and KASAN has it's own brk handler,
    which reports the bug.

    7. The memory in between slab objects is marked with a reserved tag, and
    acts as a redzone.

    8. When a slab object is freed it's marked with a reserved tag.

    Bug detection is imprecise for two reasons:

    1. We won't catch some small out-of-bounds accesses, that fall into the
    same shadow cell, as the last byte of a slab object.

    2. We only have 1 byte to store tags, which means we have a 1/256
    probability of a tag match for an incorrect access (actually even
    slightly less due to reserved tag values).

    Despite that there's a particular type of bugs that tag-based KASAN can
    detect compared to generic KASAN: use-after-free after the object has been
    allocated by someone else.

    ====== Testing

    Some kernel developers voiced a concern that changing the top byte of
    kernel pointers may lead to subtle bugs that are difficult to discover.
    To address this concern deliberate testing has been performed.

    It doesn't seem feasible to do some kind of static checking to find
    potential issues with pointer tagging, so a dynamic approach was taken.
    All pointer comparisons/subtractions have been instrumented in an LLVM
    compiler pass and a kernel module that would print a bug report whenever
    two pointers with different tags are being compared/subtracted (ignoring
    comparisons with NULL pointers and with pointers obtained by casting an
    error code to a pointer type) has been used. Then the kernel has been
    booted in QEMU and on an Odroid C2 board and syzkaller has been run.

    This yielded the following results.

    The two places that look interesting are:

    is_vmalloc_addr in include/linux/mm.h
    is_kernel_rodata in mm/util.c

    Here we compare a pointer with some fixed untagged values to make sure
    that the pointer lies in a particular part of the kernel address space.
    Since tag-based KASAN doesn't add tags to pointers that belong to rodata
    or vmalloc regions, this should work as is. To make sure debug checks to
    those two functions that check that the result doesn't change whether we
    operate on pointers with or without untagging has been added.

    A few other cases that don't look that interesting:

    Comparing pointers to achieve unique sorting order of pointee objects
    (e.g. sorting locks addresses before performing a double lock):

    tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
    pipe_double_lock in fs/pipe.c
    unix_state_double_lock in net/unix/af_unix.c
    lock_two_nondirectories in fs/inode.c
    mutex_lock_double in kernel/events/core.c

    ep_cmp_ffd in fs/eventpoll.c
    fsnotify_compare_groups fs/notify/mark.c

    Nothing needs to be done here, since the tags embedded into pointers
    don't change, so the sorting order would still be unique.

    Checks that a pointer belongs to some particular allocation:

    is_sibling_entry in lib/radix-tree.c
    object_is_on_stack in include/linux/sched/task_stack.h

    Nothing needs to be done here either, since two pointers can only belong
    to the same allocation if they have the same tag.

    Overall, since the kernel boots and works, there are no critical bugs.
    As for the rest, the traditional kernel testing way (use until fails) is
    the only one that looks feasible.

    Another point here is that tag-based KASAN is available under a separate
    config option that needs to be deliberately enabled. Even though it might
    be used in a "near-production" environment to find bugs that are not found
    during fuzzing or running tests, it is still a debug tool.

    ====== Benchmarks

    The following numbers were collected on Odroid C2 board. Both generic and
    tag-based KASAN were used in inline instrumentation mode.

    Boot time [1]:
    * ~1.7 sec for clean kernel
    * ~5.0 sec for generic KASAN
    * ~5.0 sec for tag-based KASAN

    Network performance [2]:
    * 8.33 Gbits/sec for clean kernel
    * 3.17 Gbits/sec for generic KASAN
    * 2.85 Gbits/sec for tag-based KASAN

    Slab memory usage after boot [3]:
    * ~40 kb for clean kernel
    * ~105 kb (~260% overhead) for generic KASAN
    * ~47 kb (~20% overhead) for tag-based KASAN

    KASAN memory overhead consists of three main parts:
    1. Increased slab memory usage due to redzones.
    2. Shadow memory (the whole reserved once during boot).
    3. Quaratine (grows gradually until some preset limit; the more the limit,
    the more the chance to detect a use-after-free).

    Comparing tag-based vs generic KASAN for each of these points:
    1. 20% vs 260% overhead.
    2. 1/16th vs 1/8th of physical memory.
    3. Tag-based KASAN doesn't require quarantine.

    [1] Time before the ext4 driver is initialized.
    [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
    [3] Measured as `cat /proc/meminfo | grep Slab`.

    ====== Some notes

    A few notes:

    1. The patchset can be found here:
    https://github.com/xairy/kasan-prototype/tree/khwasan

    2. Building requires a recent Clang version (7.0.0 or later).

    3. Stack instrumentation is not supported yet and will be added later.

    This patch (of 25):

    Tag-based KASAN changes the value of the top byte of pointers returned
    from the kernel allocation functions (such as kmalloc). This patch
    updates KASAN hooks signatures and their usage in SLAB and SLUB code to
    reflect that.

    Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Mark Rutland
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

20 Dec, 2018

1 commit


28 Nov, 2018

1 commit

  • Now that synchronize_rcu() waits for preempt-disable regions of code
    as well as RCU read-side critical sections, synchronize_sched() can be
    replaced by synchronize_rcu(). This commit therefore makes this change.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:

    Paul E. McKenney
     

27 Oct, 2018

4 commits

  • Kmalloc cache names can get quite long for large object sizes, when the
    sizes are expressed in bytes. Use 'k' and 'M' prefixes to make the names
    as short as possible e.g. in /proc/slabinfo. This works, as we mostly
    use power-of-two sizes, with exceptions only below 1k.

    Example: 'kmalloc-4194304' becomes 'kmalloc-4M'

    Link: http://lkml.kernel.org/r/20180731090649.16028-7-vbabka@suse.cz
    Suggested-by: Matthew Wilcox
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Laura Abbott
    Cc: Michal Hocko
    Cc: Sumit Semwal
    Cc: Vijayanand Jitta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
    indicates they contain objects which can be reclaimed under memory
    pressure (typically through a shrinker). This makes the slab pages
    accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
    MemAvailable meminfo counter and in overcommit decisions. The slab pages
    are also allocated with __GFP_RECLAIMABLE, which is good for
    anti-fragmentation through grouping pages by mobility.

    The generic kmalloc-X caches are created without this flag, but sometimes
    are used also for objects that can be reclaimed, which due to varying size
    cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
    prominent example are dcache external names, which prompted the creation
    of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
    in commit f1782c9bc547 ("dcache: account external names as indirectly
    reclaimable memory").

    To better handle this and any other similar cases, this patch introduces
    SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
    They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
    gfp flags. They are added to the kmalloc_caches array as a new type.
    Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
    cache.

    This change only applies to SLAB and SLUB, not SLOB. This is fine, since
    SLOB's target are tiny system and this patch does add some overhead of
    kmem management objects.

    Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Laura Abbott
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Sumit Semwal
    Cc: Vijayanand Jitta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "kmalloc-reclaimable caches", v4.

    As discussed at LSF/MM [1] here's a patchset that introduces
    kmalloc-reclaimable caches (more details in the second patch) and uses
    them for dcache external names. That allows us to repurpose the
    NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.

    With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
    caches, eliminating the need for manual accounting. More importantly, it
    also ensures the reclaimable kmalloc allocations are grouped in pages
    separate from the regular kmalloc allocations. The need for proper
    accounting of dcache external names has shown it's easy for misbehaving
    process to allocate lots of them, causing premature OOMs. Without the
    added grouping, it's likely that a similar workload can interleave the
    dcache external names allocations with regular kmalloc allocations (note:
    I haven't searched myself for an example of such regular kmalloc
    allocation, but I would be very surprised if there wasn't some). A
    pathological case would be e.g. one 64byte regular allocations with 63
    external dcache names in a page (64x64=4096), which means the page is not
    freed even after reclaiming after all dcache names, and the process can
    thus "steal" the whole page with single 64byte allocation.

    If other kmalloc users similar to dcache external names become identified,
    they can also benefit from the new functionality simply by adding
    __GFP_RECLAIMABLE to the kmalloc calls.

    Side benefits of the patchset (that could be also merged separately)
    include removed branch for detecting __GFP_DMA kmalloc(), and shortening
    kmalloc cache names in /proc/slabinfo output. The latter is potentially
    an ABI break in case there are tools parsing the names and expecting the
    values to be in bytes.

    This is how /proc/slabinfo looks like after booting in virtme:

    ...
    kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
    ...
    kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
    kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
    kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
    kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
    kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
    kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
    ...

    /proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:

    ...
    nr_slab_reclaimable 2817
    nr_slab_unreclaimable 1781
    ...
    nr_kernel_misc_reclaimable 0
    ...

    /proc/meminfo with new KReclaimable counter:

    ...
    Shmem: 564 kB
    KReclaimable: 11260 kB
    Slab: 18368 kB
    SReclaimable: 11260 kB
    SUnreclaim: 7108 kB
    KernelStack: 1248 kB
    ...

    This patch (of 6):

    The kmalloc caches currently mainain separate (optional) array
    kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
    __GFP_DMA in the allocation hotpaths. We can avoid the branches by
    combining kmalloc_caches and kmalloc_dma_caches into a single
    two-dimensional array where the outer dimension is cache "type". This
    will also allow to add kmalloc-reclaimable caches as a third type.

    Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Laura Abbott
    Cc: Sumit Semwal
    Cc: Vijayanand Jitta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE,
    instead it falls back to kmalloc_large().

    For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls
    kmalloc_slab() for all allocations relying on NULL return value for
    over-sized allocations.

    This inconsistency leads to unwanted warnings from kmalloc_slab() for
    over-sized allocations for slab. Returning NULL for failed allocations is
    the expected behavior.

    Make slub and slab code consistent by checking size >
    KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab().

    While we are here also fix the check in kmalloc_slab(). We should check
    against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE. It all kinda
    worked because for slab the constants are the same, and slub always checks
    the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab(). But if we
    get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will
    happen. For example, in case of a newly introduced bug in slub code.

    Also move the check in kmalloc_slab() from function entry to the size >
    192 case. This partially compensates for the additional check in slab
    code and makes slub code a bit faster (at least theoretically).

    Also drop __GFP_NOWARN in the warning check. This warning means a bug in
    slab code itself, user-passed flags have nothing to do with it.

    Nothing of this affects slob.

    Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.com
    Signed-off-by: Dmitry Vyukov
    Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com
    Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com
    Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com
    Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com
    Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com
    Acked-by: Christoph Lameter
    Acked-by: Vlastimil Babka
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

18 Aug, 2018

1 commit

  • Introduce new config option, which is used to replace repeating
    CONFIG_MEMCG && !CONFIG_SLOB pattern. Next patches add a little more
    memcg+kmem related code, so let's keep the defines more clearly.

    Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

29 Jun, 2018

1 commit

  • In kernel 4.17 I removed some code from dm-bufio that did slab cache
    merging (commit 21bb13276768: "dm bufio: remove code that merges slab
    caches") - both slab and slub support merging caches with identical
    attributes, so dm-bufio now just calls kmem_cache_create and relies on
    implicit merging.

    This uncovered a bug in the slub subsystem - if we delete a cache and
    immediatelly create another cache with the same attributes, it fails
    because of duplicate filename in /sys/kernel/slab/. The slub subsystem
    offloads freeing the cache to a workqueue - and if we create the new
    cache before the workqueue runs, it complains because of duplicate
    filename in sysfs.

    This patch fixes the bug by moving the call of kobject_del from
    sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called
    while we hold slab_mutex - so that the sysfs entry is deleted before a
    cache with the same attributes could be created.

    Running device-mapper-test-suite with:

    dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/

    triggered:

    Buffer I/O error on dev dm-0, logical block 1572848, async page read
    device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
    device-mapper: thin: 253:1: aborting current metadata transaction
    sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
    CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
    Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
    Workqueue: dm-thin do_worker [dm_thin_pool]
    Call Trace:
    dump_stack+0x5a/0x73
    sysfs_warn_dup+0x58/0x70
    sysfs_create_dir_ns+0x77/0x80
    kobject_add_internal+0xba/0x2e0
    kobject_init_and_add+0x70/0xb0
    sysfs_slab_add+0xb1/0x250
    __kmem_cache_create+0x116/0x150
    create_cache+0xd9/0x1f0
    kmem_cache_create_usercopy+0x1c1/0x250
    kmem_cache_create+0x18/0x20
    dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
    dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
    __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
    dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
    metadata_operation_failed+0x59/0x100 [dm_thin_pool]
    alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
    process_cell+0x2a3/0x550 [dm_thin_pool]
    do_worker+0x28d/0x8f0 [dm_thin_pool]
    process_one_work+0x171/0x370
    worker_thread+0x49/0x3f0
    kthread+0xf8/0x130
    ret_from_fork+0x35/0x40
    kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
    kmem_cache_create(dm_bufio_buffer-16) failed with error -17

    Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Mikulas Patocka
    Reported-by: Mike Snitzer
    Tested-by: Mike Snitzer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

15 Jun, 2018

2 commits

  • mm/*.c files use symbolic and octal styles for permissions.

    Using octal and not symbolic permissions is preferred by many as more
    readable.

    https://lkml.org/lkml/2016/8/2/1945

    Prefer the direct use of octal for permissions.

    Done using
    $ scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace mm/*.c
    and some typing.

    Before: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
    44
    After: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
    86

    Miscellanea:

    o Whitespace neatening around these conversions.

    Link: http://lkml.kernel.org/r/2e032ef111eebcd4c5952bae86763b541d373469.1522102887.git.joe@perches.com
    Signed-off-by: Joe Perches
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The memcg kmem cache creation and deactivation (SLUB only) is
    asynchronous. If a root kmem cache is destroyed whose memcg cache is in
    the process of creation or deactivation, the kernel may crash.

    Example of one such crash:
    general protection fault: 0000 [#1] SMP PTI
    CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
    ...
    Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
    RIP: 0010:has_cpu_slab
    ...
    Call Trace:
    ? on_each_cpu_cond
    __kmem_cache_shrink
    kmemcg_cache_deact_after_rcu
    kmemcg_deactivate_workfn
    process_one_work
    worker_thread
    kthread
    ret_from_fork+0x35/0x40

    To fix this race, on root kmem cache destruction, mark the cache as
    dying and flush the workqueue used for memcg kmem cache creation and
    deactivation. SLUB's memcg kmem cache deactivation also includes RCU
    callback and thus make sure all previous registered RCU callbacks have
    completed as well.

    [shakeelb@google.com: handle the RCU callbacks for SLUB deactivation]
    Link: http://lkml.kernel.org/r/20180611192951.195727-1-shakeelb@google.com
    [shakeelb@google.com: add more documentation, rename fields for readability]
    Link: http://lkml.kernel.org/r/20180522201336.196994-1-shakeelb@google.com
    [akpm@linux-foundation.org: fix build, per Shakeel]
    [shakeelb@google.com: v3. Instead of refcount, flush the workqueue]
    Link: http://lkml.kernel.org/r/20180530001204.183758-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20180521174116.171846-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

06 Apr, 2018

2 commits

  • should_failslab() is a convenient function to hook into for directed
    error injection into kmalloc(). However, it is only available if a
    config flag is set.

    The following BCC script, for example, fails kmalloc() calls after a
    btrfs umount:

    from bcc import BPF

    prog = r"""
    BPF_HASH(flag);

    #include

    int kprobe__btrfs_close_devices(void *ctx) {
    u64 key = 1;
    flag.update(&key, &key);
    return 0;
    }

    int kprobe__should_failslab(struct pt_regs *ctx) {
    u64 key = 1;
    u64 *res;
    res = flag.lookup(&key);
    if (res != 0) {
    bpf_override_return(ctx, -ENOMEM);
    }
    return 0;
    }
    """
    b = BPF(text=prog)

    while 1:
    b.kprobe_poll()

    This patch refactors the should_failslab implementation so that the
    function is always available for error injection, independent of flags.

    This change would be similar in nature to commit f5490d3ec921 ("block:
    Add should_fail_bio() for bpf error injection").

    Link: http://lkml.kernel.org/r/20180222020320.6944-1-hmclauchlan@fb.com
    Signed-off-by: Howard McLauchlan
    Reviewed-by: Andrew Morton
    Cc: Akinobu Mita
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Josef Bacik
    Cc: Johannes Weiner
    Cc: Alexei Starovoitov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Howard McLauchlan
     
  • Since commit db265eca7700 ("mm/sl[aou]b: Move duping of slab name to
    slab_common.c"), the kernel always duplicates the slab cache name when
    creating a slab cache, so the test if the slab name is accessible is
    useless.

    Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1803231133310.22626@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Mikulas Patocka
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka