06 Apr, 2019

1 commit

  • [ Upstream commit 92d1d07daad65c300c7d0b68bbef8867e9895d54 ]

    Kmemleak throws endless warnings during boot due to in
    __alloc_alien_cache(),

    alc = kmalloc_node(memsize, gfp, node);
    init_arraycache(&alc->ac, entries, batch);
    kmemleak_no_scan(ac);

    Kmemleak does not track the array cache (alc->ac) but the alien cache
    (alc) instead, so let it track the latter by lifting kmemleak_no_scan()
    out of init_arraycache().

    There is another place that calls init_arraycache(), but
    alloc_kmem_cache_cpus() uses the percpu allocation where will never be
    considered as a leak.

    kmemleak: Found object by alias at 0xffff8007b9aa7e38
    CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
    Call trace:
    dump_backtrace+0x0/0x168
    show_stack+0x24/0x30
    dump_stack+0x88/0xb0
    lookup_object+0x84/0xac
    find_and_get_object+0x84/0xe4
    kmemleak_no_scan+0x74/0xf4
    setup_kmem_cache_node+0x2b4/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    ret_from_fork+0x10/0x18
    kmemleak: Object 0xffff8007b9aa7e00 (size 256):
    kmemleak: comm "swapper/0", pid 1, jiffies 4294697137
    kmemleak: min_count = 1
    kmemleak: count = 0
    kmemleak: flags = 0x1
    kmemleak: checksum = 0
    kmemleak: backtrace:
    kmemleak_alloc+0x84/0xb8
    kmem_cache_alloc_node_trace+0x31c/0x3a0
    __kmalloc_node+0x58/0x78
    setup_kmem_cache_node+0x26c/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
    CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
    Call trace:
    dump_backtrace+0x0/0x168
    show_stack+0x24/0x30
    dump_stack+0x88/0xb0
    kmemleak_no_scan+0x90/0xf4
    setup_kmem_cache_node+0x2b4/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    ret_from_fork+0x10/0x18

    Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
    Fixes: 1fe00d50a9e8 ("slab: factor out initialization of array cache")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     

03 Apr, 2019

1 commit

  • commit 6d6ea1e967a246f12cfe2f5fb743b70b2e608d4a upstream.

    Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
    v6.

    This is a followup to the discussion in [1], [2].

    IOMMUs using ARMv7 short-descriptor format require page tables (level 1
    and 2) to be allocated within the first 4GB of RAM, even on 64-bit
    systems.

    For L1 tables that are bigger than a page, we can just use
    __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
    use GFP_DMA).

    For L2 tables that only take 1KB, it would be a waste to allocate a full
    page, so we considered 3 approaches:
    1. This series, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2 page
    tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable to reuse
    freed fragments until the whole page is freed. [3]

    This series is the most memory-efficient approach.

    stable@ note:
    We confirmed that this is a regression, and IOMMU errors happen on 4.19
    and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
    most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
    with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
    platforms (and maybe others?).

    [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
    [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
    [3] https://patchwork.codeaurora.org/patch/671639/

    This patch (of 3):

    IOMMUs using ARMv7 short-descriptor format require page tables to be
    allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
    this is done by passing GFP_DMA32 flag to memory allocation functions.

    For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
    a full page using get_free_pages, so we considered 3 approaches:
    1. This patch, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2
    page tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable
    to reuse freed fragments until the whole page is freed.

    This change makes it possible to create a custom cache in DMA32 zone using
    kmem_cache_create, then allocate memory using kmem_cache_alloc.

    We do not create a DMA32 kmalloc cache array, as there are currently no
    users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
    warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.

    This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
    kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
    unnecessary).

    Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
    Signed-off-by: Nicolas Boichat
    Acked-by: Vlastimil Babka
    Acked-by: Will Deacon
    Cc: Robin Murphy
    Cc: Joerg Roedel
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Sasha Levin
    Cc: Huaisheng Ye
    Cc: Mike Rapoport
    Cc: Yong Wu
    Cc: Matthias Brugger
    Cc: Tomasz Figa
    Cc: Yingjoe Chen
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Hsin-Yi Wang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Nicolas Boichat
     

17 Jan, 2019

1 commit

  • commit 09c2e76ed734a1d36470d257a778aaba28e86531 upstream.

    Callers of __alloc_alien() check for NULL. We must do the same check in
    __alloc_alien_cache to avoid NULL pointer dereferences on allocation
    failures.

    Link: http://lkml.kernel.org/r/010001680f42f192-82b4e12e-1565-4ee0-ae1f-1e98974906aa-000000@email.amazonses.com
    Fixes: 49dfc304ba241 ("slab: use the lock on alien_cache, instead of the lock on array_cache")
    Fixes: c8522a3a5832b ("Slab: introduce alloc_alien")
    Signed-off-by: Christoph Lameter
    Reported-by: syzbot+d6ed4ec679652b4fd4e4@syzkaller.appspotmail.com
    Reviewed-by: Andrew Morton
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Christoph Lameter
     

01 Dec, 2018

1 commit

  • commit 61448479a9f2c954cde0cfe778cb6bec5d0a748d upstream.

    Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE,
    instead it falls back to kmalloc_large().

    For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls
    kmalloc_slab() for all allocations relying on NULL return value for
    over-sized allocations.

    This inconsistency leads to unwanted warnings from kmalloc_slab() for
    over-sized allocations for slab. Returning NULL for failed allocations is
    the expected behavior.

    Make slub and slab code consistent by checking size >
    KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab().

    While we are here also fix the check in kmalloc_slab(). We should check
    against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE. It all kinda
    worked because for slab the constants are the same, and slub always checks
    the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab(). But if we
    get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will
    happen. For example, in case of a newly introduced bug in slub code.

    Also move the check in kmalloc_slab() from function entry to the size >
    192 case. This partially compensates for the additional check in slab
    code and makes slub code a bit faster (at least theoretically).

    Also drop __GFP_NOWARN in the warning check. This warning means a bug in
    slab code itself, user-passed flags have nothing to do with it.

    Nothing of this affects slob.

    Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.com
    Signed-off-by: Dmitry Vyukov
    Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com
    Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com
    Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com
    Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com
    Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com
    Acked-by: Christoph Lameter
    Acked-by: Vlastimil Babka
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Vyukov
     

13 Jun, 2018

1 commit

  • The kzalloc() function has a 2-factor argument form, kcalloc(). This
    patch replaces cases of:

    kzalloc(a * b, gfp)

    with:
    kcalloc(a * b, gfp)

    as well as handling cases of:

    kzalloc(a * b * c, gfp)

    with:

    kzalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kzalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kzalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kzalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kzalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kzalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kzalloc
    + kcalloc
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kzalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kzalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kzalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kzalloc(sizeof(THING) * C2, ...)
    |
    kzalloc(sizeof(TYPE) * C2, ...)
    |
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(C1 * C2, ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

08 Jun, 2018

2 commits

  • rcu_head may now grow larger than list_head without affecting slab or
    slub.

    Link: http://lkml.kernel.org/r/20180518194519.3820-15-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Christoph Lameter
    Acked-by: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: "Kirill A . Shutemov"
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • __GFP_ZERO requests that the object be initialised to all-zeroes, while
    the purpose of a constructor is to initialise an object to a particular
    pattern. We cannot do both. Add a warning to catch any users who
    mistakenly pass a __GFP_ZERO flag when allocating a slab with a
    constructor.

    Link: http://lkml.kernel.org/r/20180412191322.GA21205@bombadil.infradead.org
    Fixes: d07dbea46405 ("Slab allocators: support __GFP_ZERO in all allocators")
    Signed-off-by: Matthew Wilcox
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

14 Apr, 2018

1 commit

  • cache_reap() is initially scheduled in start_cpu_timer() via
    schedule_delayed_work_on(). But then the next iterations are scheduled
    via schedule_delayed_work(), i.e. using WORK_CPU_UNBOUND.

    Thus since commit ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND
    work on wq_unbound_cpumask CPUs") there is no guarantee the future
    iterations will run on the originally intended cpu, although it's still
    preferred. I was able to demonstrate this with
    /sys/module/workqueue/parameters/debug_force_rr_cpu. IIUC, it may also
    happen due to migrating timers in nohz context. As a result, some cpu's
    would be calling cache_reap() more frequently and others never.

    This patch uses schedule_delayed_work_on() with the current cpu when
    scheduling the next iteration.

    Link: http://lkml.kernel.org/r/20180411070007.32225-1-vbabka@suse.cz
    Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
    Signed-off-by: Vlastimil Babka
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Tejun Heo
    Cc: Lai Jiangshan
    Cc: John Stultz
    Cc: Thomas Gleixner
    Cc: Stephen Boyd
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

06 Apr, 2018

4 commits

  • The kasan quarantine is designed to delay freeing slab objects to catch
    use-after-free. The quarantine can be large (several percent of machine
    memory size). When kmem_caches are deleted related objects are flushed
    from the quarantine but this requires scanning the entire quarantine
    which can be very slow. We have seen the kernel busily working on this
    while holding slab_mutex and badly affecting cache_reaper, slabinfo
    readers and memcg kmem cache creations.

    It can easily reproduced by following script:

    yes . | head -1000000 | xargs stat > /dev/null
    for i in `seq 1 10`; do
    seq 500 | (cd /cg/memory && xargs mkdir)
    seq 500 | xargs -I{} sh -c 'echo $BASHPID > \
    /cg/memory/{}/tasks && exec stat .' > /dev/null
    seq 500 | (cd /cg/memory && xargs rmdir)
    done

    The busy stack:
    kasan_cache_shutdown
    shutdown_cache
    memcg_destroy_kmem_caches
    mem_cgroup_css_free
    css_free_rwork_fn
    process_one_work
    worker_thread
    kthread
    ret_from_fork

    This patch is based on the observation that if the kmem_cache to be
    destroyed is empty then there should not be any objects of this cache in
    the quarantine.

    Without the patch the script got stuck for couple of hours. With the
    patch the script completed within a second.

    Link: http://lkml.kernel.org/r/20180327230603.54721-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Reviewed-by: Andrew Morton
    Acked-by: Andrey Ryabinin
    Acked-by: Christoph Lameter
    Cc: Vladimir Davydov
    Cc: Alexander Potapenko
    Cc: Greg Thelen
    Cc: Dmitry Vyukov
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • If SLAB doesn't support 4GB+ kmem caches (it never did), KASAN should
    not do it as well.

    Link: http://lkml.kernel.org/r/20180305200730.15812-20-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Now that all sizes are properly typed, propagate "unsigned int" down the
    callgraph.

    Link: http://lkml.kernel.org/r/20180305200730.15812-19-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • struct kmem_cache::size and ::align were always 32-bit.

    Out of curiosity I created 4GB kmem_cache, it oopsed with division by 0.
    kmem_cache_create(1UL<
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

29 Mar, 2018

1 commit

  • All the root caches are linked into slab_root_caches which was
    introduced by the commit 510ded33e075 ("slab: implement slab_root_caches
    list") but it missed to add the SLAB's kmem_cache.

    While experimenting with opt-in/opt-out kmem accounting, I noticed
    system crashes due to NULL dereference inside cache_from_memcg_idx()
    while deferencing kmem_cache.memcg_params.memcg_caches. The upstream
    clean kernel will not see these crashes but SLAB should be consistent
    with SLUB which does linked its boot caches (kmem_cache_node and
    kmem_cache) into slab_root_caches.

    Link: http://lkml.kernel.org/r/20180319210020.60289-1-shakeelb@google.com
    Fixes: 510ded33e075c ("slab: implement slab_root_caches list")
    Signed-off-by: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

07 Feb, 2018

1 commit

  • __builtin_return_address(1) is unreliable without frame pointers.
    With defconfig on kmalloc_pagealloc_invalid_free test I am getting:

    BUG: KASAN: double-free or invalid-free in (null)

    Pass caller PC from callers explicitly.

    Link: http://lkml.kernel.org/r/9b01bc2d237a4df74ff8472a3bf6b7635908de01.1514378558.git.dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Cc: Andrey Ryabinin a
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

04 Feb, 2018

1 commit

  • Pull hardened usercopy whitelisting from Kees Cook:
    "Currently, hardened usercopy performs dynamic bounds checking on slab
    cache objects. This is good, but still leaves a lot of kernel memory
    available to be copied to/from userspace in the face of bugs.

    To further restrict what memory is available for copying, this creates
    a way to whitelist specific areas of a given slab cache object for
    copying to/from userspace, allowing much finer granularity of access
    control.

    Slab caches that are never exposed to userspace can declare no
    whitelist for their objects, thereby keeping them unavailable to
    userspace via dynamic copy operations. (Note, an implicit form of
    whitelisting is the use of constant sizes in usercopy operations and
    get_user()/put_user(); these bypass all hardened usercopy checks since
    these sizes cannot change at runtime.)

    This new check is WARN-by-default, so any mistakes can be found over
    the next several releases without breaking anyone's system.

    The series has roughly the following sections:
    - remove %p and improve reporting with offset
    - prepare infrastructure and whitelist kmalloc
    - update VFS subsystem with whitelists
    - update SCSI subsystem with whitelists
    - update network subsystem with whitelists
    - update process memory with whitelists
    - update per-architecture thread_struct with whitelists
    - update KVM with whitelists and fix ioctl bug
    - mark all other allocations as not whitelisted
    - update lkdtm for more sensible test overage"

    * tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits)
    lkdtm: Update usercopy tests for whitelisting
    usercopy: Restrict non-usercopy caches to size 0
    kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
    kvm: whitelist struct kvm_vcpu_arch
    arm: Implement thread_struct whitelist for hardened usercopy
    arm64: Implement thread_struct whitelist for hardened usercopy
    x86: Implement thread_struct whitelist for hardened usercopy
    fork: Provide usercopy whitelisting for task_struct
    fork: Define usercopy region in thread_stack slab caches
    fork: Define usercopy region in mm_struct slab caches
    net: Restrict unwhitelisted proto caches to size 0
    sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
    sctp: Define usercopy region in SCTP proto slab cache
    caif: Define usercopy region in caif proto slab cache
    ip: Define usercopy region in IP proto slab cache
    net: Define usercopy region in struct proto slab cache
    scsi: Define usercopy region in scsi_sense_cache slab cache
    cifs: Define usercopy region in cifs_request slab cache
    vxfs: Define usercopy region in vxfs_inode slab cache
    ufs: Define usercopy region in ufs_inode_cache slab cache
    ...

    Linus Torvalds
     

01 Feb, 2018

1 commit

  • slab_state is being set to "UP" in create_kmalloc_caches(), and later on
    we set it again in kmem_cache_init_late(), but slab_state does not
    change in the meantime.

    Remove the redundant assignment from kmem_cache_init_late().

    And unless I overlooked anything, the same goes for "slab_state = FULL".
    slab_state is set to "FULL" in kmem_cache_init_late(), but it is later
    being set again in cpucache_init(), which gets called from
    do_initcall_level(). So remove the assignment from cpucache_init() as
    well.

    Link: http://lkml.kernel.org/r/20171215134452.GA1920@techadventures.net
    Signed-off-by: Oscar Salvador
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

16 Jan, 2018

5 commits

  • Mark the kmalloc slab caches as entirely whitelisted. These caches
    are frequently used to fulfill kernel allocations that contain data
    to be copied to/from userspace. Internal-only uses are also common,
    but are scattered in the kernel. For now, mark all the kmalloc caches
    as whitelisted.

    This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
    whitelisting code in the last public patch of grsecurity/PaX based on my
    understanding of the code. Changes or omissions from the original code are
    mine and don't reflect the original grsecurity/PaX code.

    Signed-off-by: David Windsor
    [kees: merged in moved kmalloc hunks, adjust commit log]
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Cc: linux-xfs@vger.kernel.org
    Signed-off-by: Kees Cook
    Acked-by: Christoph Lameter

    David Windsor
     
  • This introduces CONFIG_HARDENED_USERCOPY_FALLBACK to control the
    behavior of hardened usercopy whitelist violations. By default, whitelist
    violations will continue to WARN() so that any bad or missing usercopy
    whitelists can be discovered without being too disruptive.

    If this config is disabled at build time or a system is booted with
    "slab_common.usercopy_fallback=0", usercopy whitelists will BUG() instead
    of WARN(). This is useful for admins that want to use usercopy whitelists
    immediately.

    Suggested-by: Matthew Garrett
    Signed-off-by: Kees Cook

    Kees Cook
     
  • This patch adds checking of usercopy cache whitelisting, and is modified
    from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the
    last public patch of grsecurity/PaX based on my understanding of the
    code. Changes or omissions from the original code are mine and don't
    reflect the original grsecurity/PaX code.

    The SLAB and SLUB allocators are modified to WARN() on all copy operations
    in which the kernel heap memory being modified falls outside of the cache's
    defined usercopy region.

    Based on an earlier patch from David Windsor.

    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc: Laura Abbott
    Cc: Ingo Molnar
    Cc: Mark Rutland
    Cc: linux-mm@kvack.org
    Cc: linux-xfs@vger.kernel.org
    Signed-off-by: Kees Cook

    Kees Cook
     
  • This patch prepares the slab allocator to handle caches having annotations
    (useroffset and usersize) defining usercopy regions.

    This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
    whitelisting code in the last public patch of grsecurity/PaX based on
    my understanding of the code. Changes or omissions from the original
    code are mine and don't reflect the original grsecurity/PaX code.

    Currently, hardened usercopy performs dynamic bounds checking on slab
    cache objects. This is good, but still leaves a lot of kernel memory
    available to be copied to/from userspace in the face of bugs. To further
    restrict what memory is available for copying, this creates a way to
    whitelist specific areas of a given slab cache object for copying to/from
    userspace, allowing much finer granularity of access control. Slab caches
    that are never exposed to userspace can declare no whitelist for their
    objects, thereby keeping them unavailable to userspace via dynamic copy
    operations. (Note, an implicit form of whitelisting is the use of constant
    sizes in usercopy operations and get_user()/put_user(); these bypass
    hardened usercopy checks since these sizes cannot change at runtime.)

    To support this whitelist annotation, usercopy region offset and size
    members are added to struct kmem_cache. The slab allocator receives a
    new function, kmem_cache_create_usercopy(), that creates a new cache
    with a usercopy region defined, suitable for declaring spans of fields
    within the objects that get copied to/from userspace.

    In this patch, the default kmem_cache_create() marks the entire allocation
    as whitelisted, leaving it semantically unchanged. Once all fine-grained
    whitelists have been added (in subsequent patches), this will be changed
    to a usersize of 0, making caches created with kmem_cache_create() not
    copyable to/from userspace.

    After the entire usercopy whitelist series is applied, less than 15%
    of the slab cache memory remains exposed to potential usercopy bugs
    after a fresh boot:

    Total Slab Memory: 48074720
    Usercopyable Memory: 6367532 13.2%
    task_struct 0.2% 4480/1630720
    RAW 0.3% 300/96000
    RAWv6 2.1% 1408/64768
    ext4_inode_cache 3.0% 269760/8740224
    dentry 11.1% 585984/5273856
    mm_struct 29.1% 54912/188448
    kmalloc-8 100.0% 24576/24576
    kmalloc-16 100.0% 28672/28672
    kmalloc-32 100.0% 81920/81920
    kmalloc-192 100.0% 96768/96768
    kmalloc-128 100.0% 143360/143360
    names_cache 100.0% 163840/163840
    kmalloc-64 100.0% 167936/167936
    kmalloc-256 100.0% 339968/339968
    kmalloc-512 100.0% 350720/350720
    kmalloc-96 100.0% 455616/455616
    kmalloc-8192 100.0% 655360/655360
    kmalloc-1024 100.0% 812032/812032
    kmalloc-4096 100.0% 819200/819200
    kmalloc-2048 100.0% 1310720/1310720

    After some kernel build workloads, the percentage (mainly driven by
    dentry and inode caches expanding) drops under 10%:

    Total Slab Memory: 95516184
    Usercopyable Memory: 8497452 8.8%
    task_struct 0.2% 4000/1456000
    RAW 0.3% 300/96000
    RAWv6 2.1% 1408/64768
    ext4_inode_cache 3.0% 1217280/39439872
    dentry 11.1% 1623200/14608800
    mm_struct 29.1% 73216/251264
    kmalloc-8 100.0% 24576/24576
    kmalloc-16 100.0% 28672/28672
    kmalloc-32 100.0% 94208/94208
    kmalloc-192 100.0% 96768/96768
    kmalloc-128 100.0% 143360/143360
    names_cache 100.0% 163840/163840
    kmalloc-64 100.0% 245760/245760
    kmalloc-256 100.0% 339968/339968
    kmalloc-512 100.0% 350720/350720
    kmalloc-96 100.0% 563520/563520
    kmalloc-8192 100.0% 655360/655360
    kmalloc-1024 100.0% 794624/794624
    kmalloc-4096 100.0% 819200/819200
    kmalloc-2048 100.0% 1257472/1257472

    Signed-off-by: David Windsor
    [kees: adjust commit log, split out a few extra kmalloc hunks]
    [kees: add field names to function declarations]
    [kees: convert BUGs to WARNs and fail closed]
    [kees: add attack surface reduction analysis to commit log]
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Cc: linux-xfs@vger.kernel.org
    Signed-off-by: Kees Cook
    Acked-by: Christoph Lameter

    David Windsor
     
  • This refactors the hardened usercopy code so that failure reporting can
    happen within the checking functions instead of at the top level. This
    simplifies the return value handling and allows more details and offsets
    to be included in the report. Having the offset can be much more helpful
    in understanding hardened usercopy bugs.

    Signed-off-by: Kees Cook

    Kees Cook
     

15 Dec, 2017

1 commit

  • If CONFIG_DEBUG_SLAB/CONFIG_DEBUG_SLAB_LEAK are enabled, the slab code
    prints extra debug information when e.g. corruption is detected. This
    includes pointers, which are not very useful when hashed.

    Fix this by using %px to print unhashed pointers instead where it makes
    sense, and by removing the printing of a last user pointer referring to
    code.

    [geert+renesas@glider.be: v2]
    Link: http://lkml.kernel.org/r/1513179267-2509-1-git-send-email-geert+renesas@glider.be
    Link: http://lkml.kernel.org/r/1512641861-5113-1-git-send-email-geert+renesas@glider.be
    Fixes: ad67b74d2469d9b8 ("printk: hash addresses printed with %p")
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Christoph Lameter
    Acked-by: Linus Torvalds
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: "Tobin C . Harding"
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

16 Nov, 2017

6 commits

  • Convert all allocations that used a NOTRACK flag to stop using it.

    Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Levin, Alexander (Sasha Levin)
     
  • Patch series "kmemcheck: kill kmemcheck", v2.

    As discussed at LSF/MM, kill kmemcheck.

    KASan is a replacement that is able to work without the limitation of
    kmemcheck (single CPU, slow). KASan is already upstream.

    We are also not aware of any users of kmemcheck (or users who don't
    consider KASan as a suitable replacement).

    The only objection was that since KASAN wasn't supported by all GCC
    versions provided by distros at that time we should hold off for 2
    years, and try again.

    Now that 2 years have passed, and all distros provide gcc that supports
    KASAN, kill kmemcheck again for the very same reasons.

    This patch (of 4):

    Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

    [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
    Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
    Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Levin, Alexander (Sasha Levin)
     
  • struct kmem_cache::flags is "unsigned long" which is unnecessary on
    64-bit as no flags are defined in the higher bits.

    Switch the field to 32-bit and save some space on x86_64 until such
    flags appear:

    add/remove: 0/0 grow/shrink: 0/107 up/down: 0/-657 (-657)
    function old new delta
    sysfs_slab_add 720 719 -1
    ...
    check_object 699 676 -23

    [akpm@linux-foundation.org: fix printk warning]
    Link: http://lkml.kernel.org/r/20171021100635.GA8287@avx2
    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON,
    etc).

    SLAB is bloated temporarily by switching to "unsigned long", but only
    temporarily.

    Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2
    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • SLAB_RECLAIM_ACCOUNT is a permanent attribute of a slab cache. Set
    __GFP_RECLAIMABLE as part of its ->allocflags rather than check the
    cachep flag on every page allocation.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1710171527560.140898@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • According to discussion with Christoph
    (https://marc.info/?l=linux-kernel&m=150695909709711&w=2), it sounds like
    it is pointless to keep CONFIG_SLABINFO around.

    This patch removes the CONFIG_SLABINFO config option, but /proc/slabinfo
    is still available.

    [yang.s@alibaba-inc.com: v11]
    Link: http://lkml.kernel.org/r/1507656303-103845-3-git-send-email-yang.s@alibaba-inc.com
    Link: http://lkml.kernel.org/r/1507152550-46205-3-git-send-email-yang.s@alibaba-inc.com
    Signed-off-by: Yang Shi
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

07 Jul, 2017

3 commits

  • Josef's redesign of the balancing between slab caches and the page cache
    requires slab cache statistics at the lruvec level.

    Link: http://lkml.kernel.org/r/20170530181724.27197-7-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "mm: per-lruvec slab stats"

    Josef is working on a new approach to balancing slab caches and the page
    cache. For this to work, he needs slab cache statistics on the lruvec
    level. These patches implement that by adding infrastructure that
    allows updating and reading generic VM stat items per lruvec, then
    switches some existing VM accounting sites, including the slab
    accounting ones, to this new cgroup-aware API.

    I'll follow up with more patches on this, because there is actually
    substantial simplification that can be done to the memory controller
    when we replace private memcg accounting with making the existing VM
    accounting sites cgroup-aware. But this is enough for Josef to base his
    slab reclaim work on, so here goes.

    This patch (of 5):

    To re-implement slab cache vs. page cache balancing, we'll need the
    slab counters at the lruvec level, which, ever since lru reclaim was
    moved from the zone to the node, is the intersection of the node, not
    the zone, and the memcg.

    We could retain the per-zone counters for when the page allocator dumps
    its memory information on failures, and have counters on both levels -
    which on all but NUMA node 0 is usually redundant. But let's keep it
    simple for now and just move them. If anybody complains we can restore
    the per-zone counters.

    [hannes@cmpxchg.org: fix oops]
    Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org
    Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Link: http://lkml.kernel.org/r/20170616072918epcms5p4ff16c24ef8472b4c3b4371823cd87856@epcms5p4
    Signed-off-by: Canjiang Lu
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Canjiang Lu
     

11 May, 2017

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes are:

    - Debloat RCU headers

    - Parallelize SRCU callback handling (plus overlapping patches)

    - Improve the performance of Tree SRCU on a CPU-hotplug stress test

    - Documentation updates

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
    rcu: Open-code the rcu_cblist_n_lazy_cbs() function
    rcu: Open-code the rcu_cblist_n_cbs() function
    rcu: Open-code the rcu_cblist_empty() function
    rcu: Separately compile large rcu_segcblist functions
    srcu: Debloat the header
    srcu: Adjust default auto-expediting holdoff
    srcu: Specify auto-expedite holdoff time
    srcu: Expedite first synchronize_srcu() when idle
    srcu: Expedited grace periods with reduced memory contention
    srcu: Make rcutorture writer stalls print SRCU GP state
    srcu: Exact tracking of srcu_data structures containing callbacks
    srcu: Make SRCU be built by default
    srcu: Fix Kconfig botch when SRCU not selected
    rcu: Make non-preemptive schedule be Tasks RCU quiescent state
    srcu: Expedite srcu_schedule_cbs_snp() callback invocation
    srcu: Parallelize callback handling
    kvm: Move srcu_struct fields to end of struct kvm
    rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
    rcu: Use true/false in assignment to bool
    rcu: Use bool value directly
    ...

    Linus Torvalds
     

04 May, 2017

1 commit

  • Each slab kmem cache has per cpu array caches. The array caches are
    created when the kmem_cache is created, either via kmem_cache_create()
    or lazily when the first object is allocated in context of a kmem
    enabled memcg. Array caches are replaced by writing to /proc/slabinfo.

    Array caches are protected by holding slab_mutex or disabling
    interrupts. Array cache allocation and replacement is done by
    __do_tune_cpucache() which holds slab_mutex and calls
    kick_all_cpus_sync() to interrupt all remote processors which confirms
    there are no references to the old array caches.

    IPIs are needed when replacing array caches. But when creating a new
    array cache, there's no need to send IPIs because there cannot be any
    references to the new cache. Outside of memcg kmem accounting these
    IPIs occur at boot time, so they're not a problem. But with memcg kmem
    accounting each container can create kmem caches, so the IPIs are
    wasteful.

    Avoid unnecessary IPIs when creating array caches.

    Test which reports the IPI count of allocating slab in 10000 memcg:

    import os

    def ipi_count():
    with open("/proc/interrupts") as f:
    for l in f:
    if 'Function call interrupts' in l:
    return int(l.split()[1])

    def echo(val, path):
    with open(path, "w") as f:
    f.write(val)

    n = 10000
    os.chdir("/mnt/cgroup/memory")
    pid = str(os.getpid())
    a = ipi_count()
    for i in range(n):
    os.mkdir(str(i))
    echo("1G\n", "%d/memory.limit_in_bytes" % i)
    echo("1G\n", "%d/memory.kmem.limit_in_bytes" % i)
    echo(pid, "%d/cgroup.procs" % i)
    open("/tmp/x", "w").close()
    os.unlink("/tmp/x")
    b = ipi_count()
    print "%d loops: %d => %d (+%d ipis)" % (n, a, b, b-a)
    echo(pid, "cgroup.procs")
    for i in range(n):
    os.rmdir(str(i))

    patched: 10000 loops: 1069 => 1170 (+101 ipis)
    unpatched: 10000 loops: 1192 => 48933 (+47741 ipis)

    Link: http://lkml.kernel.org/r/20170416214544.109476-1-gthelen@google.com
    Signed-off-by: Greg Thelen
    Acked-by: Joonsoo Kim
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

19 Apr, 2017

1 commit

  • A group of Linux kernel hackers reported chasing a bug that resulted
    from their assumption that SLAB_DESTROY_BY_RCU provided an existence
    guarantee, that is, that no block from such a slab would be reallocated
    during an RCU read-side critical section. Of course, that is not the
    case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
    slab of blocks.

    However, there is a phrase for this, namely "type safety". This commit
    therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
    to avoid future instances of this sort of confusion.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    [ paulmck: Add comments mentioning the old name, as requested by Eric
    Dumazet, in order to help people familiar with the old name find
    the new one. ]
    Acked-by: David Rientjes

    Paul E. McKenney
     

02 Mar, 2017

1 commit


23 Feb, 2017

3 commits

  • __kmem_cache_shrink() is called with %true @deactivate only for memcg
    caches. Remove @deactivate from __kmem_cache_shrink() and introduce
    __kmemcg_cache_deactivate() instead. Each memcg-supporting allocator
    should implement it and it should deactivate and drain the cache.

    This is to allow memcg cache deactivation behavior to further deviate
    from simple shrinking without messing up __kmem_cache_shrink().

    This is pure reorganization and doesn't introduce any observable
    behavior changes.

    v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.

    Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.org
    Signed-off-by: Tejun Heo
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Patch series "slab: make memcg slab destruction scalable", v3.

    With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code.

    I've seen machines which end up with hundred thousands of caches and
    many millions of kernfs_nodes. The current code is O(N^2) on the total
    number of caches and has synchronous rcu_barrier() and
    synchronize_sched() in cgroup offline / release path which is executed
    while holding cgroup_mutex. Combined, this leads to very expensive and
    slow cache destruction operations which can easily keep running for half
    a day.

    This also messes up /proc/slabinfo along with other cache iterating
    operations. seq_file operates on 4k chunks and on each 4k boundary
    tries to seek to the last position in the list. With a huge number of
    caches on the list, this becomes very slow and very prone to the list
    content changing underneath it leading to a lot of missing and/or
    duplicate entries.

    This patchset addresses the scalability problem.

    * Add root and per-memcg lists. Update each user to use the
    appropriate list.

    * Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched
    and asynchronous.

    * For dying empty slub caches, remove the sysfs files after
    deactivation so that we don't end up with millions of sysfs files
    without any useful information on them.

    This patchset contains the following nine patches.

    0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
    0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch
    0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
    0004-slab-reorganize-memcg_cache_params.patch
    0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
    0006-slab-implement-slab_root_caches-list.patch
    0007-slab-introduce-__kmemcg_cache_deactivate.patch
    0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
    0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch
    0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch

    0001 reverts an existing optimization to prepare for the following
    changes. 0002 is a prep patch. 0003 makes rcu_barrier() in release
    path batched and asynchronous. 0004-0006 separate out the lists.
    0007-0008 replace synchronize_sched() in slub destruction path with
    call_rcu_sched(). 0009 removes sysfs files early for empty dying
    caches. 0010 makes destruction work items use a workqueue with limited
    concurrency.

    This patch (of 10):

    Revert 89e364db71fb5e ("slub: move synchronize_sched out of slab_mutex on
    shrink").

    With kmem cgroup support enabled, kmem_caches can be created and destroyed
    frequently and a great number of near empty kmem_caches can accumulate if
    there are a lot of transient cgroups and the system is not under memory
    pressure. When memory reclaim starts under such conditions, it can lead
    to consecutive deactivation and destruction of many kmem_caches, easily
    hundreds of thousands on moderately large systems, exposing scalability
    issues in the current slab management code. This is one of the patches to
    address the issue.

    Moving synchronize_sched() out of slab_mutex isn't enough as it's still
    inside cgroup_mutex. The whole deactivation / release path will be
    updated to avoid all synchronous RCU operations. Revert this insufficient
    optimization in preparation to ease future changes.

    Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Cc: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • SLAB as part of its bootstrap pre-creates one kmalloc cache that can fit
    the kmem_cache_node management structure, and puts it into the generic
    kmalloc cache array (e.g. for 128b objects). The name of this cache is
    "kmalloc-node", which is confusing for readers of /proc/slabinfo as the
    cache is used for generic allocations (and not just the kmem_cache_node
    struct) and it appears as the kmalloc-128 cache is missing.

    An easy solution is to use the kmalloc- name when pre-creating the
    cache, which we can get from the kmalloc_info array.

    Example /proc/slabinfo before the patch:

    ...
    kmalloc-256 1647 1984 256 16 1 : tunables 120 60 8 : slabdata 124 124 828
    kmalloc-192 1974 1974 192 21 1 : tunables 120 60 8 : slabdata 94 94 133
    kmalloc-96 1332 1344 128 32 1 : tunables 120 60 8 : slabdata 42 42 219
    kmalloc-64 2505 5952 64 64 1 : tunables 120 60 8 : slabdata 93 93 715
    kmalloc-32 4278 4464 32 124 1 : tunables 120 60 8 : slabdata 36 36 346
    kmalloc-node 1352 1376 128 32 1 : tunables 120 60 8 : slabdata 43 43 53
    kmem_cache 132 147 192 21 1 : tunables 120 60 8 : slabdata 7 7 0

    After the patch:

    ...
    kmalloc-256 1672 2160 256 16 1 : tunables 120 60 8 : slabdata 135 135 807
    kmalloc-192 1992 2016 192 21 1 : tunables 120 60 8 : slabdata 96 96 203
    kmalloc-96 1159 1184 128 32 1 : tunables 120 60 8 : slabdata 37 37 116
    kmalloc-64 2561 4864 64 64 1 : tunables 120 60 8 : slabdata 76 76 785
    kmalloc-32 4253 4340 32 124 1 : tunables 120 60 8 : slabdata 35 35 270
    kmalloc-128 1256 1280 128 32 1 : tunables 120 60 8 : slabdata 40 40 39
    kmem_cache 125 147 192 21 1 : tunables 120 60 8 : slabdata 7 7 0

    [vbabka@suse.cz: export the whole kmalloc_info structure instead of just a name accessor, per Christoph Lameter]
    Link: http://lkml.kernel.org/r/54e80303-b814-4232-66d4-95b34d3eb9d0@suse.cz
    Link: http://lkml.kernel.org/r/20170203181008.24898-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Matthew Wilcox
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

11 Jan, 2017

1 commit

  • This patch fixes a bug in the freelist randomization code. When a high
    random number is used, the freelist will contain duplicate entries. It
    will result in different allocations sharing the same chunk.

    It will result in odd behaviours and crashes. It should be uncommon but
    it depends on the machines. We saw it happening more often on some
    machines (every few hours of running tests).

    Fixes: c7ce4f60ac19 ("mm: SLAB freelist randomization")
    Link: http://lkml.kernel.org/r/20170103181908.143178-1-thgarnie@google.com
    Signed-off-by: John Sperbeck
    Signed-off-by: Thomas Garnier
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Sperbeck