23 Jan, 2020

1 commit

  • commit 8e57f8acbbd121ecfb0c9dc13b8b030f86c6bd3b upstream.

    Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
    debugging") has introduced a static key to reduce overhead when
    debug_pagealloc is compiled in but not enabled. It relied on the
    assumption that jump_label_init() is called before parse_early_param()
    as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
    it is safe to enable the static key.

    However, it turns out multiple architectures call parse_early_param()
    earlier from their setup_arch(). x86 also calls jump_label_init() even
    earlier, so no issue was found while testing the commit, but same is not
    true for e.g. ppc64 and s390 where the kernel would not boot with
    debug_pagealloc=on as found by our QA.

    To fix this without tricky changes to init code of multiple
    architectures, this patch partially reverts the static key conversion
    from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
    code) of debug_pagealloc_enabled() will again test a simple bool
    variable. Fastpath mm code is converted to a new
    debug_pagealloc_enabled_static() variant that relies on the static key,
    which is enabled in a well-defined point in mm_init() where it's
    guaranteed that jump_label_init() has been called, regardless of
    architecture.

    [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
    Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
    Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Stephen Rothwell
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Qian Cai
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

15 Oct, 2019

1 commit

  • Fix kernel-doc warning in mm/slab.c:

    mm/slab.c:4215: warning: Function parameter or member 'objp' not described in '__ksize'

    Also add Return: documentation section for this function.

    Link: http://lkml.kernel.org/r/68c9fd7d-f09e-d376-e292-c7b2bdf1774d@infradead.org
    Fixes: 10d1f8cb3965 ("mm/slab: refactor common ksize KASAN logic into slab_common.c")
    Signed-off-by: Randy Dunlap
    Acked-by: Marco Elver
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

13 Jul, 2019

6 commits

  • Patch series "add init_on_alloc/init_on_free boot options", v10.

    Provide init_on_alloc and init_on_free boot options.

    These are aimed at preventing possible information leaks and making the
    control-flow bugs that depend on uninitialized values more deterministic.

    Enabling either of the options guarantees that the memory returned by the
    page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
    isn't supported at the moment, as its emulation of kmem caches complicates
    handling of SLAB_TYPESAFE_BY_RCU caches correctly.

    Enabling init_on_free also guarantees that pages and heap objects are
    initialized right after they're freed, so it won't be possible to access
    stale data by using a dangling pointer.

    As suggested by Michal Hocko, right now we don't let the heap users to
    disable initialization for certain allocations. There's not enough
    evidence that doing so can speed up real-life cases, and introducing ways
    to opt-out may result in things going out of control.

    This patch (of 2):

    The new options are needed to prevent possible information leaks and make
    control-flow bugs that depend on uninitialized values more deterministic.

    This is expected to be on-by-default on Android and Chrome OS. And it
    gives the opportunity for anyone else to use it under distros too via the
    boot args. (The init_on_free feature is regularly requested by folks
    where memory forensics is included in their threat models.)

    init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
    objects with zeroes. Initialization is done at allocation time at the
    places where checks for __GFP_ZERO are performed.

    init_on_free=1 makes the kernel initialize freed pages and heap objects
    with zeroes upon their deletion. This helps to ensure sensitive data
    doesn't leak via use-after-free accesses.

    Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
    returns zeroed memory. The two exceptions are slab caches with
    constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
    zero-initialized to preserve their semantics.

    Both init_on_alloc and init_on_free default to zero, but those defaults
    can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
    CONFIG_INIT_ON_FREE_DEFAULT_ON.

    If either SLUB poisoning or page poisoning is enabled, those options take
    precedence over init_on_alloc and init_on_free: initialization is only
    applied to unpoisoned allocations.

    Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:

    hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
    hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)

    Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
    Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
    Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
    Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)

    The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
    is within the standard error.

    The new features are also going to pave the way for hardware memory
    tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
    hooks to set the tags for heap objects. With MTE, tagging will have the
    same cost as memory initialization.

    Although init_on_free is rather costly, there are paranoid use-cases where
    in-memory data lifetime is desired to be minimized. There are various
    arguments for/against the realism of the associated threat models, but
    given that we'll need the infrastructure for MTE anyway, and there are
    people who want wipe-on-free behavior no matter what the performance cost,
    it seems reasonable to include it in this series.

    [glider@google.com: v8]
    Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
    [glider@google.com: v10]
    Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
    Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
    Signed-off-by: Alexander Potapenko
    Acked-by: Kees Cook
    Acked-by: Michal Hocko [page and dmapool parts
    Acked-by: James Morris ]
    Cc: Christoph Lameter
    Cc: Masahiro Yamada
    Cc: "Serge E. Hallyn"
    Cc: Nick Desaulniers
    Cc: Kostya Serebryany
    Cc: Dmitry Vyukov
    Cc: Sandeep Patil
    Cc: Laura Abbott
    Cc: Randy Dunlap
    Cc: Jann Horn
    Cc: Mark Rutland
    Cc: Marco Elver
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Currently the page accounting code is duplicated in SLAB and SLUB
    internals. Let's move it into new (un)charge_slab_page helpers in the
    slab_common.c file. These helpers will be responsible for statistics
    (global and memcg-aware) and memcg charging. So they are replacing direct
    memcg_(un)charge_slab() calls.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Christoph Lameter
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently SLUB uses a work scheduled after an RCU grace period to
    deactivate a non-root kmem_cache. This mechanism can be reused for
    kmem_caches release, but requires generalization for SLAB case.

    Introduce kmemcg_cache_deactivate() function, which calls
    allocator-specific __kmem_cache_deactivate() and schedules execution of
    __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
    context after an rcu grace period.

    Here is the new calling scheme:
    kmemcg_cache_deactivate()
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    __kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific

    instead of:
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    slab_deactivate_memcg_cache_rcu_sched() SLUB-only
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    kmemcg_cache_deact_after_rcu() SLUB-only

    For consistency, all allocator-specific functions start with "__".

    Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "mm: reparent slab memory on cgroup removal", v7.

    # Why do we need this?

    We've noticed that the number of dying cgroups is steadily growing on most
    of our hosts in production. The following investigation revealed an issue
    in the userspace memory reclaim code [1], accounting of kernel stacks [2],
    and also the main reason: slab objects.

    The underlying problem is quite simple: any page charged to a cgroup holds
    a reference to it, so the cgroup can't be reclaimed unless all charged
    pages are gone. If a slab object is actively used by other cgroups, it
    won't be reclaimed, and will prevent the origin cgroup from being
    reclaimed.

    Slab objects, and first of all vfs cache, is shared between cgroups, which
    are using the same underlying fs, and what's even more important, it's
    shared between multiple generations of the same workload. So if something
    is running periodically every time in a new cgroup (like how systemd
    works), we do accumulate multiple dying cgroups.

    Strictly speaking pagecache isn't different here, but there is a key
    difference: we disable protection and apply some extra pressure on LRUs of
    dying cgroups, and these LRUs contain all charged pages. My experiments
    show that with the disabled kernel memory accounting the number of dying
    cgroups stabilizes at a relatively small number (~100, depends on memory
    pressure and cgroup creation rate), and with kernel memory accounting it
    grows pretty steadily up to several thousands.

    Memory cgroups are quite complex and big objects (mostly due to percpu
    stats), so it leads to noticeable memory losses. Memory occupied by dying
    cgroups is measured in hundreds of megabytes. I've even seen a host with
    more than 100Gb of memory wasted for dying cgroups. It leads to a
    degradation of performance with the uptime, and generally limits the usage
    of cgroups.

    My previous attempt [3] to fix the problem by applying extra pressure on
    slab shrinker lists caused a regressions with xfs and ext4, and has been
    reverted [4]. The following attempts to find the right balance [5, 6]
    were not successful.

    So instead of trying to find a maybe non-existing balance, let's do
    reparent accounted slab caches to the parent cgroup on cgroup removal.

    # Implementation approach

    There is however a significant problem with reparenting of slab memory:
    there is no list of charged pages. Some of them are in shrinker lists,
    but not all. Introducing of a new list is really not an option.

    But fortunately there is a way forward: every slab page has a stable
    pointer to the corresponding kmem_cache. So the idea is to reparent
    kmem_caches instead of slab pages.

    It's actually simpler and cheaper, but requires some underlying changes:
    1) Make kmem_caches to hold a single reference to the memory cgroup,
    instead of a separate reference per every slab page.
    2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
    page->kmem_cache->memcg indirection instead. It's used only on
    slab page release, so performance overhead shouldn't be a big issue.
    3) Introduce a refcounter for non-root slab caches. It's required to
    be able to destroy kmem_caches when they become empty and release
    the associated memory cgroup.

    There is a bonus: currently we release all memcg kmem_caches all together
    with the memory cgroup itself. This patchset allows individual
    kmem_caches to be released as soon as they become inactive and free.

    Some additional implementation details are provided in corresponding
    commit messages.

    # Results

    Below is the average number of dying cgroups on two groups of our
    production hosts. They do run some sort of web frontend workload, the
    memory pressure is moderate. As we can see, with the kernel memory
    reparenting the number stabilizes in 60s range; however with the original
    version it grows almost linearly and doesn't show any signs of plateauing.
    The difference in slab and percpu usage between patched and unpatched
    versions also grows linearly. In 7 days it exceeded 200Mb.

    day 0 1 2 3 4 5 6 7
    original 56 362 628 752 1070 1250 1490 1560
    patched 23 46 51 55 60 57 67 69
    mem diff(Mb) 22 74 123 152 164 182 214 241

    # Links

    [1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error")
    [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
    [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
    [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
    [5]: https://lkml.org/lkml/2019/1/28/1865
    [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2

    This patch (of 10):

    Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
    rather than in init_memcg_params().

    Once kmem_cache will hold a reference to the memory cgroup, it will
    simplify the refcounting.

    For non-root kmem_caches memcg_link_cache() is always called before the
    kmem_cache becomes visible to a user, so it's safe.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Waiman Long
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This refactors common code of ksize() between the various allocators into
    slab_common.c: __ksize() is the allocator-specific implementation without
    instrumentation, whereas ksize() includes the required KASAN logic.

    Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.com
    Signed-off-by: Marco Elver
    Acked-by: Christoph Lameter
    Reviewed-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mark Rutland
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marco Elver
     
  • This avoids any possible type confusion when looking up an object. For
    example, if a non-slab were to be passed to kfree(), the invalid
    slab_cache pointer (i.e. overlapped with some other value from the
    struct page union) would be used for subsequent slab manipulations that
    could lead to further memory corruption.

    Since the page is already in cache, adding the PageSlab() check will
    have nearly zero cost, so add a check and WARN() to virt_to_cache().
    Additionally replaces an open-coded virt_to_cache(). To support the
    failure mode this also updates all callers of virt_to_cache() and
    cache_from_obj() to handle a NULL cache pointer return value (though
    note that several already handle this case gracefully).

    [dan.carpenter@oracle.com: restore IRQs in kfree()]
    Link: http://lkml.kernel.org/r/20190613065637.GE16334@mwanda
    Link: http://lkml.kernel.org/r/20190530045017.15252-3-keescook@chromium.org
    Signed-off-by: Kees Cook
    Signed-off-by: Dan Carpenter
    Cc: Alexander Popov
    Cc: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Greg Kroah-Hartman
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

17 May, 2019

1 commit

  • It turned out that DEBUG_SLAB_LEAK is still broken even after recent
    recue efforts that when there is a large number of objects like
    kmemleak_object which is normal on a debug kernel,

    # grep kmemleak /proc/slabinfo
    kmemleak_object 2243606 3436210 ...

    reading /proc/slab_allocators could easily loop forever while processing
    the kmemleak_object cache and any additional freeing or allocating
    objects will trigger a reprocessing. To make a situation worse,
    soft-lockups could easily happen in this sitatuion which will call
    printk() to allocate more kmemleak objects to guarantee an infinite
    loop.

    Also, since it seems no one had noticed when it was totally broken
    more than 2-year ago - see the commit fcf88917dd43 ("slab: fix a crash
    by reading /proc/slab_allocators"), probably nobody cares about it
    anymore due to the decline of the SLAB. Just remove it entirely.

    Suggested-by: Vlastimil Babka
    Suggested-by: Linus Torvalds
    Signed-off-by: Qian Cai
    Signed-off-by: Linus Torvalds

    Qian Cai
     

15 May, 2019

3 commits

  • "cat /proc/slab_allocators" could hang forever on SMP machines with
    kmemleak or object debugging enabled due to other CPUs running do_drain()
    will keep making kmemleak_object or debug_objects_cache dirty and unable
    to escape the first loop in leaks_show(),

    do {
    set_store_user_clean(cachep);
    drain_cpu_caches(cachep);
    ...

    } while (!is_store_user_clean(cachep));

    For example,

    do_drain
    slabs_destroy
    slab_destroy
    kmem_cache_free
    __cache_free
    ___cache_free
    kmemleak_free_recursive
    delete_object_full
    __delete_object
    put_object
    free_object_rcu
    kmem_cache_free
    cache_free_debugcheck --> dirty kmemleak_object

    One approach is to check cachep->name and skip both kmemleak_object and
    debug_objects_cache in leaks_show(). The other is to set store_user_clean
    after drain_cpu_caches() which leaves a small window between
    drain_cpu_caches() and set_store_user_clean() where per-CPU caches could
    be dirty again lead to slightly wrong information has been stored but
    could also speed up things significantly which sounds like a good
    compromise. For example,

    # cat /proc/slab_allocators
    0m42.778s # 1st approach
    0m0.737s # 2nd approach

    [akpm@linux-foundation.org: tweak comment]
    Link: http://lkml.kernel.org/r/20190411032635.10325-1-cai@lca.pw
    Fixes: d31676dfde25 ("mm/slab: alternative implementation for DEBUG_SLAB_LEAK")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • nc is a member of percpu allocation memory, and cannot be NULL.

    Link: http://lkml.kernel.org/r/1553159353-5056-1-git-send-email-lirongqing@baidu.com
    Signed-off-by: Li RongQing
    Reviewed-by: Andrew Morton
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     
  • Currently we use the page->lru list for maintaining lists of slabs. We
    have a list in the page structure (slab_list) that can be used for this
    purpose. Doing so makes the code cleaner since we are not overloading the
    lru list.

    Use the slab_list instead of the lru list for maintaining lists of slabs.

    Link: http://lkml.kernel.org/r/20190402230545.2929-7-tobin@kernel.org
    Signed-off-by: Tobin C. Harding
    Acked-by: Christoph Lameter
    Reviewed-by: Roman Gushchin
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobin C. Harding
     

07 May, 2019

1 commit

  • Pull x86 irq updates from Ingo Molnar:
    "Here are the main changes in this tree:

    - Introduce x86-64 IRQ/exception/debug stack guard pages to detect
    stack overflows immediately and deterministically.

    - Clean up over a decade worth of cruft accumulated.

    The outcome of this should be more clear-cut faults/crashes when any
    of the low level x86 CPU stacks overflow, instead of silent memory
    corruption and sporadic failures much later on"

    * 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    x86/irq: Fix outdated comments
    x86/irq/64: Remove stack overflow debug code
    x86/irq/64: Remap the IRQ stack with guard pages
    x86/irq/64: Split the IRQ stack into its own pages
    x86/irq/64: Init hardirq_stack_ptr during CPU hotplug
    x86/irq/32: Handle irq stack allocation failure proper
    x86/irq/32: Invoke irq_ctx_init() from init_IRQ()
    x86/irq/64: Rename irq_stack_ptr to hardirq_stack_ptr
    x86/irq/32: Rename hard/softirq_stack to hard/softirq_stack_ptr
    x86/irq/32: Make irq stack a character array
    x86/irq/32: Define IRQ_STACK_SIZE
    x86/dumpstack/64: Speedup in_exception_stack()
    x86/exceptions: Split debug IST stack
    x86/exceptions: Enable IST guard pages
    x86/exceptions: Disconnect IST index and stack order
    x86/cpu: Remove orig_ist array
    x86/cpu: Prepare TSS.IST setup for guard pages
    x86/dumpstack/64: Use cpu_entry_area instead of orig_ist
    x86/irq/64: Use cpu entry area instead of orig_ist
    x86/traps: Use cpu_entry_area instead of orig_ist
    ...

    Linus Torvalds
     

20 Apr, 2019

1 commit

  • Commit 51dedad06b5f ("kasan, slab: make freelist stored without tags")
    calls kasan_reset_tag() for off-slab slab management object leading to
    freelist being stored non-tagged.

    However, cache_grow_begin() calls alloc_slabmgmt() which calls
    kmem_cache_alloc_node() assigns a tag for the address and stores it in
    the shadow address. As the result, it causes endless errors below
    during boot due to drain_freelist() -> slab_destroy() ->
    kasan_slab_free() which compares already untagged freelist against the
    stored tag in the shadow address.

    Since off-slab slab management object freelist is such a special case,
    just store it tagged. Non-off-slab management object freelist is still
    stored untagged which has not been assigned a tag and should not cause
    any other troubles with this inconsistency.

    BUG: KASAN: double-free or invalid-free in slab_destroy+0x84/0x88
    Pointer tag: [ff], memory tag: [99]

    CPU: 0 PID: 1376 Comm: kworker/0:4 Tainted: G W 5.1.0-rc3+ #8
    Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018
    Workqueue: cgroup_destroy css_killed_work_fn
    Call trace:
    print_address_description+0x74/0x2a4
    kasan_report_invalid_free+0x80/0xc0
    __kasan_slab_free+0x204/0x208
    kasan_slab_free+0xc/0x18
    kmem_cache_free+0xe4/0x254
    slab_destroy+0x84/0x88
    drain_freelist+0xd0/0x104
    __kmem_cache_shrink+0x1ac/0x224
    __kmemcg_cache_deactivate+0x1c/0x28
    memcg_deactivate_kmem_caches+0xa0/0xe8
    memcg_offline_kmem+0x8c/0x3d4
    mem_cgroup_css_offline+0x24c/0x290
    css_killed_work_fn+0x154/0x618
    process_one_work+0x9cc/0x183c
    worker_thread+0x9b0/0xe38
    kthread+0x374/0x390
    ret_from_fork+0x10/0x18

    Allocated by task 1625:
    __kasan_kmalloc+0x168/0x240
    kasan_slab_alloc+0x18/0x20
    kmem_cache_alloc_node+0x1f8/0x3a0
    cache_grow_begin+0x4fc/0xa24
    cache_alloc_refill+0x2f8/0x3e8
    kmem_cache_alloc+0x1bc/0x3bc
    sock_alloc_inode+0x58/0x334
    alloc_inode+0xb8/0x164
    new_inode_pseudo+0x20/0xec
    sock_alloc+0x74/0x284
    __sock_create+0xb0/0x58c
    sock_create+0x98/0xb8
    __sys_socket+0x60/0x138
    __arm64_sys_socket+0xa4/0x110
    el0_svc_handler+0x2c0/0x47c
    el0_svc+0x8/0xc

    Freed by task 1625:
    __kasan_slab_free+0x114/0x208
    kasan_slab_free+0xc/0x18
    kfree+0x1a8/0x1e0
    single_release+0x7c/0x9c
    close_pdeo+0x13c/0x43c
    proc_reg_release+0xec/0x108
    __fput+0x2f8/0x784
    ____fput+0x1c/0x28
    task_work_run+0xc0/0x1b0
    do_notify_resume+0xb44/0x1278
    work_pending+0x8/0x10

    The buggy address belongs to the object at ffff809681b89e00
    which belongs to the cache kmalloc-128 of size 128
    The buggy address is located 0 bytes inside of
    128-byte region [ffff809681b89e00, ffff809681b89e80)
    The buggy address belongs to the page:
    page:ffff7fe025a06e00 count:1 mapcount:0 mapping:01ff80082000fb00
    index:0xffff809681b8fe04
    flags: 0x17ffffffc000200(slab)
    raw: 017ffffffc000200 ffff7fe025a06d08 ffff7fe022ef7b88 01ff80082000fb00
    raw: ffff809681b8fe04 ffff809681b80000 00000001000000e0 0000000000000000
    page dumped because: kasan: bad access detected
    page allocated via order 0, migratetype Unmovable, gfp_mask
    0x2420c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE)
    prep_new_page+0x4e0/0x5e0
    get_page_from_freelist+0x4ce8/0x50d4
    __alloc_pages_nodemask+0x738/0x38b8
    cache_grow_begin+0xd8/0xa24
    ____cache_alloc_node+0x14c/0x268
    __kmalloc+0x1c8/0x3fc
    ftrace_free_mem+0x408/0x1284
    ftrace_free_init_mem+0x20/0x28
    kernel_init+0x24/0x548
    ret_from_fork+0x10/0x18

    Memory state around the buggy address:
    ffff809681b89c00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
    ffff809681b89d00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
    >ffff809681b89e00: 99 99 99 99 99 99 99 99 fe fe fe fe fe fe fe fe
    ^
    ffff809681b89f00: 43 43 43 43 43 fe fe fe fe fe fe fe fe fe fe fe
    ffff809681b8a000: 6d fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe

    Link: http://lkml.kernel.org/r/20190403022858.97584-1-cai@lca.pw
    Fixes: 51dedad06b5f ("kasan, slab: make freelist stored without tags")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

17 Apr, 2019

1 commit

  • store_stackinfo() does not seem used in actual SLAB debugging.
    Potentially, it could be added to check_poison_obj() to provide more
    information but this seems like an overkill due to the declining
    popularity of SLAB, so just remove it instead.

    Signed-off-by: Qian Cai
    Signed-off-by: Borislav Petkov
    Acked-by: Thomas Gleixner
    Acked-by: Vlastimil Babka
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Josh Poimboeuf
    Cc: linux-mm
    Cc: Pekka Enberg
    Cc: rientjes@google.com
    Cc: sean.j.christopherson@intel.com
    Link: https://lkml.kernel.org/r/20190416142258.18694-1-cai@lca.pw

    Qian Cai
     

08 Apr, 2019

1 commit

  • The commit 510ded33e075 ("slab: implement slab_root_caches list")
    changes the name of the list node within "struct kmem_cache" from "list"
    to "root_caches_node", but leaks_show() still use the "list" which
    causes a crash when reading /proc/slab_allocators.

    You need to have CONFIG_SLAB=y and CONFIG_MEMCG=y to see the problem,
    because without MEMCG all slab caches are root caches, and the "list"
    node happens to be the right one.

    Fixes: 510ded33e075 ("slab: implement slab_root_caches list")
    Signed-off-by: Qian Cai
    Reviewed-by: Tobin C. Harding
    Cc: Tejun Heo
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

30 Mar, 2019

1 commit

  • Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
    v6.

    This is a followup to the discussion in [1], [2].

    IOMMUs using ARMv7 short-descriptor format require page tables (level 1
    and 2) to be allocated within the first 4GB of RAM, even on 64-bit
    systems.

    For L1 tables that are bigger than a page, we can just use
    __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
    use GFP_DMA).

    For L2 tables that only take 1KB, it would be a waste to allocate a full
    page, so we considered 3 approaches:
    1. This series, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2 page
    tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable to reuse
    freed fragments until the whole page is freed. [3]

    This series is the most memory-efficient approach.

    stable@ note:
    We confirmed that this is a regression, and IOMMU errors happen on 4.19
    and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
    most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
    with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
    platforms (and maybe others?).

    [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
    [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
    [3] https://patchwork.codeaurora.org/patch/671639/

    This patch (of 3):

    IOMMUs using ARMv7 short-descriptor format require page tables to be
    allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
    this is done by passing GFP_DMA32 flag to memory allocation functions.

    For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
    a full page using get_free_pages, so we considered 3 approaches:
    1. This patch, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2
    page tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable
    to reuse freed fragments until the whole page is freed.

    This change makes it possible to create a custom cache in DMA32 zone using
    kmem_cache_create, then allocate memory using kmem_cache_alloc.

    We do not create a DMA32 kmalloc cache array, as there are currently no
    users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
    warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.

    This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
    kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
    unnecessary).

    Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
    Signed-off-by: Nicolas Boichat
    Acked-by: Vlastimil Babka
    Acked-by: Will Deacon
    Cc: Robin Murphy
    Cc: Joerg Roedel
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Sasha Levin
    Cc: Huaisheng Ye
    Cc: Mike Rapoport
    Cc: Yong Wu
    Cc: Matthias Brugger
    Cc: Tomasz Figa
    Cc: Yingjoe Chen
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Hsin-Yi Wang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Boichat
     

06 Mar, 2019

3 commits

  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Number of NUMA nodes can't be negative.

    This saves a few bytes on x86_64:

    add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
    Function old new delta
    hv_synic_alloc.cold 88 110 +22
    prealloc_shrinker 260 262 +2
    bootstrap 249 251 +2
    sched_init_numa 1566 1567 +1
    show_slab_objects 778 777 -1
    s_show 1201 1200 -1
    kmem_cache_init 346 345 -1
    __alloc_workqueue_key 1146 1145 -1
    mem_cgroup_css_alloc 1614 1612 -2
    __do_sys_swapon 4702 4699 -3
    __list_lru_init 655 651 -4
    nic_probe 2379 2374 -5
    store_user_store 118 111 -7
    red_zone_store 106 99 -7
    poison_store 106 99 -7
    wq_numa_init 348 338 -10
    __kmem_cache_empty 75 65 -10
    task_numa_free 186 173 -13
    merge_across_nodes_store 351 336 -15
    irq_create_affinity_masks 1261 1246 -15
    do_numa_crng_init 343 321 -22
    task_numa_fault 4760 4737 -23
    swapfile_init 179 156 -23
    hv_synic_alloc 536 492 -44
    apply_wqattrs_prepare 746 695 -51

    Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Kmemleak throws endless warnings during boot due to in
    __alloc_alien_cache(),

    alc = kmalloc_node(memsize, gfp, node);
    init_arraycache(&alc->ac, entries, batch);
    kmemleak_no_scan(ac);

    Kmemleak does not track the array cache (alc->ac) but the alien cache
    (alc) instead, so let it track the latter by lifting kmemleak_no_scan()
    out of init_arraycache().

    There is another place that calls init_arraycache(), but
    alloc_kmem_cache_cpus() uses the percpu allocation where will never be
    considered as a leak.

    kmemleak: Found object by alias at 0xffff8007b9aa7e38
    CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
    Call trace:
    dump_backtrace+0x0/0x168
    show_stack+0x24/0x30
    dump_stack+0x88/0xb0
    lookup_object+0x84/0xac
    find_and_get_object+0x84/0xe4
    kmemleak_no_scan+0x74/0xf4
    setup_kmem_cache_node+0x2b4/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    ret_from_fork+0x10/0x18
    kmemleak: Object 0xffff8007b9aa7e00 (size 256):
    kmemleak: comm "swapper/0", pid 1, jiffies 4294697137
    kmemleak: min_count = 1
    kmemleak: count = 0
    kmemleak: flags = 0x1
    kmemleak: checksum = 0
    kmemleak: backtrace:
    kmemleak_alloc+0x84/0xb8
    kmem_cache_alloc_node_trace+0x31c/0x3a0
    __kmalloc_node+0x58/0x78
    setup_kmem_cache_node+0x26c/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
    CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
    Call trace:
    dump_backtrace+0x0/0x168
    show_stack+0x24/0x30
    dump_stack+0x88/0xb0
    kmemleak_no_scan+0x90/0xf4
    setup_kmem_cache_node+0x2b4/0x35c
    __do_tune_cpucache+0x250/0x2d4
    do_tune_cpucache+0x4c/0xe4
    enable_cpucache+0xc8/0x110
    setup_cpu_cache+0x40/0x1b8
    __kmem_cache_create+0x240/0x358
    create_cache+0xc0/0x198
    kmem_cache_create_usercopy+0x158/0x20c
    kmem_cache_create+0x50/0x64
    fsnotify_init+0x58/0x6c
    do_one_initcall+0x194/0x388
    kernel_init_freeable+0x668/0x688
    kernel_init+0x18/0x124
    ret_from_fork+0x10/0x18

    Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
    Fixes: 1fe00d50a9e8 ("slab: factor out initialization of array cache")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

22 Feb, 2019

3 commits

  • kasan_slab_alloc() calls in kmem_cache_alloc() and kmem_cache_alloc_node()
    are redundant as they are already called via slab_alloc/slab_alloc_node()->
    slab_post_alloc_hook()->kasan_slab_alloc(). Remove them.

    Link: http://lkml.kernel.org/r/4ca1655cdcfc4379c49c50f7bf80f81c4ad01485.1550602886.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Tested-by: Qian Cai
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Catalin Marinas
    Cc: Dmitry Vyukov
    Cc: Evgeniy Stepanov
    Cc: Kostya Serebryany
    Cc: Vincenzo Frascino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Similarly to "kasan, slub: move kasan_poison_slab hook before
    page_address", move kasan_poison_slab() before alloc_slabmgmt(), which
    calls page_address(), to make page_address() return value to be
    non-tagged. This, combined with calling kasan_reset_tag() for off-slab
    slab management object, leads to freelist being stored non-tagged.

    Link: http://lkml.kernel.org/r/dfb53b44a4d00de3879a05a9f04c1f55e584f7a1.1550602886.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Tested-by: Qian Cai
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Catalin Marinas
    Cc: Dmitry Vyukov
    Cc: Evgeniy Stepanov
    Cc: Kostya Serebryany
    Cc: Vincenzo Frascino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Similarly to commit 96fedce27e13 ("kasan: make tag based mode work with
    CONFIG_HARDENED_USERCOPY"), we need to reset pointer tags in
    __check_heap_object() in mm/slab.c before doing any pointer math.

    Link: http://lkml.kernel.org/r/9a5c0f958db10e69df5ff9f2b997866b56b7effc.1550602886.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Tested-by: Qian Cai
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Catalin Marinas
    Cc: Dmitry Vyukov
    Cc: Evgeniy Stepanov
    Cc: Kostya Serebryany
    Cc: Vincenzo Frascino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

09 Jan, 2019

1 commit

  • Callers of __alloc_alien() check for NULL. We must do the same check in
    __alloc_alien_cache to avoid NULL pointer dereferences on allocation
    failures.

    Link: http://lkml.kernel.org/r/010001680f42f192-82b4e12e-1565-4ee0-ae1f-1e98974906aa-000000@email.amazonses.com
    Fixes: 49dfc304ba241 ("slab: use the lock on alien_cache, instead of the lock on array_cache")
    Fixes: c8522a3a5832b ("Slab: introduce alloc_alien")
    Signed-off-by: Christoph Lameter
    Reported-by: syzbot+d6ed4ec679652b4fd4e4@syzkaller.appspotmail.com
    Reviewed-by: Andrew Morton
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

29 Dec, 2018

5 commits

  • totalram_pages and totalhigh_pages are made static inline function.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     
  • Tag-based KASAN doesn't check memory accesses through pointers tagged with
    0xff. When page_address is used to get pointer to memory that corresponds
    to some page, the tag of the resulting pointer gets set to 0xff, even
    though the allocated memory might have been tagged differently.

    For slab pages it's impossible to recover the correct tag to return from
    page_address, since the page might contain multiple slab objects tagged
    with different values, and we can't know in advance which one of them is
    going to get accessed. For non slab pages however, we can recover the tag
    in page_address, since the whole page was marked with the same tag.

    This patch adds tagging to non slab memory allocated with pagealloc. To
    set the tag of the pointer returned from page_address, the tag gets stored
    to page->flags when the memory gets allocated.

    Link: http://lkml.kernel.org/r/d758ddcef46a5abc9970182b9137e2fbee202a2c.1544099024.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Dmitry Vyukov
    Acked-by: Will Deacon
    Cc: Christoph Lameter
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • While with SLUB we can actually preassign tags for caches with contructors
    and store them in pointers in the freelist, SLAB doesn't allow that since
    the freelist is stored as an array of indexes, so there are no pointers to
    store the tags.

    Instead we compute the tag twice, once when a slab is created before
    calling the constructor and then again each time when an object is
    allocated with kmalloc. Tag is computed simply by taking the lowest byte
    of the index that corresponds to the object. However in kasan_kmalloc we
    only have access to the objects pointer, so we need a way to find out
    which index this object corresponds to.

    This patch moves obj_to_index from slab.c to include/linux/slab_def.h to
    be reused by KASAN.

    Link: http://lkml.kernel.org/r/c02cd9e574cfd93858e43ac94b05e38f891fef64.1544099024.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Dmitry Vyukov
    Acked-by: Christoph Lameter
    Cc: Mark Rutland
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • An object constructor can initialize pointers within this objects based on
    the address of the object. Since the object address might be tagged, we
    need to assign a tag before calling constructor.

    The implemented approach is to assign tags to objects with constructors
    when a slab is allocated and call constructors once as usual. The
    downside is that such object would always have the same tag when it is
    reallocated, so we won't catch use-after-frees on it.

    Also pressign tags for objects from SLAB_TYPESAFE_BY_RCU caches, since
    they can be validy accessed after having been freed.

    Link: http://lkml.kernel.org/r/f158a8a74a031d66f0a9398a5b0ed453c37ba09a.1544099024.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Mark Rutland
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Patch series "kasan: add software tag-based mode for arm64", v13.

    This patchset adds a new software tag-based mode to KASAN [1]. (Initially
    this mode was called KHWASAN, but it got renamed, see the naming rationale
    at the end of this section).

    The plan is to implement HWASan [2] for the kernel with the incentive,
    that it's going to have comparable to KASAN performance, but in the same
    time consume much less memory, trading that off for somewhat imprecise bug
    detection and being supported only for arm64.

    The underlying ideas of the approach used by software tag-based KASAN are:

    1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
    pointer tags in the top byte of each kernel pointer.

    2. Using shadow memory, we can store memory tags for each chunk of kernel
    memory.

    3. On each memory allocation, we can generate a random tag, embed it into
    the returned pointer and set the memory tags that correspond to this
    chunk of memory to the same value.

    4. By using compiler instrumentation, before each memory access we can add
    a check that the pointer tag matches the tag of the memory that is being
    accessed.

    5. On a tag mismatch we report an error.

    With this patchset the existing KASAN mode gets renamed to generic KASAN,
    with the word "generic" meaning that the implementation can be supported
    by any architecture as it is purely software.

    The new mode this patchset adds is called software tag-based KASAN. The
    word "tag-based" refers to the fact that this mode uses tags embedded into
    the top byte of kernel pointers and the TBI arm64 CPU feature that allows
    to dereference such pointers. The word "software" here means that shadow
    memory manipulation and tag checking on pointer dereference is done in
    software. As it is the only tag-based implementation right now, "software
    tag-based" KASAN is sometimes referred to as simply "tag-based" in this
    patchset.

    A potential expansion of this mode is a hardware tag-based mode, which
    would use hardware memory tagging support (announced by Arm [3]) instead
    of compiler instrumentation and manual shadow memory manipulation.

    Same as generic KASAN, software tag-based KASAN is strictly a debugging
    feature.

    [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html

    [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html

    [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a

    ====== Rationale

    On mobile devices generic KASAN's memory usage is significant problem.
    One of the main reasons to have tag-based KASAN is to be able to perform a
    similar set of checks as the generic one does, but with lower memory
    requirements.

    Comment from Vishwath Mohan :

    I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
    problematic to enable for environments that don't tolerate the increased
    memory pressure well. This includes

    (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
    (c) Connected components like Pixel's visual core [1].

    These are both places I'd love to have a low(er) memory footprint option at
    my disposal.

    Comment from Evgenii Stepanov :

    Looking at a live Android device under load, slab (according to
    /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
    overhead of 2x - 3x on top of it is not insignificant.

    Not having this overhead enables near-production use - ex. running
    KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
    not reproduce in test configuration. These are the ones that often cost
    the most engineering time to track down.

    CPU overhead is bad, but generally tolerable. RAM is critical, in our
    experience. Once it gets low enough, OOM-killer makes your life
    miserable.

    [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/

    ====== Technical details

    Software tag-based KASAN mode is implemented in a very similar way to the
    generic one. This patchset essentially does the following:

    1. TCR_TBI1 is set to enable Top Byte Ignore.

    2. Shadow memory is used (with a different scale, 1:16, so each shadow
    byte corresponds to 16 bytes of kernel memory) to store memory tags.

    3. All slab objects are aligned to shadow scale, which is 16 bytes.

    4. All pointers returned from the slab allocator are tagged with a random
    tag and the corresponding shadow memory is poisoned with the same value.

    5. Compiler instrumentation is used to insert tag checks. Either by
    calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
    CONFIG_KASAN_INLINE flags are reused).

    6. When a tag mismatch is detected in callback instrumentation mode
    KASAN simply prints a bug report. In case of inline instrumentation,
    clang inserts a brk instruction, and KASAN has it's own brk handler,
    which reports the bug.

    7. The memory in between slab objects is marked with a reserved tag, and
    acts as a redzone.

    8. When a slab object is freed it's marked with a reserved tag.

    Bug detection is imprecise for two reasons:

    1. We won't catch some small out-of-bounds accesses, that fall into the
    same shadow cell, as the last byte of a slab object.

    2. We only have 1 byte to store tags, which means we have a 1/256
    probability of a tag match for an incorrect access (actually even
    slightly less due to reserved tag values).

    Despite that there's a particular type of bugs that tag-based KASAN can
    detect compared to generic KASAN: use-after-free after the object has been
    allocated by someone else.

    ====== Testing

    Some kernel developers voiced a concern that changing the top byte of
    kernel pointers may lead to subtle bugs that are difficult to discover.
    To address this concern deliberate testing has been performed.

    It doesn't seem feasible to do some kind of static checking to find
    potential issues with pointer tagging, so a dynamic approach was taken.
    All pointer comparisons/subtractions have been instrumented in an LLVM
    compiler pass and a kernel module that would print a bug report whenever
    two pointers with different tags are being compared/subtracted (ignoring
    comparisons with NULL pointers and with pointers obtained by casting an
    error code to a pointer type) has been used. Then the kernel has been
    booted in QEMU and on an Odroid C2 board and syzkaller has been run.

    This yielded the following results.

    The two places that look interesting are:

    is_vmalloc_addr in include/linux/mm.h
    is_kernel_rodata in mm/util.c

    Here we compare a pointer with some fixed untagged values to make sure
    that the pointer lies in a particular part of the kernel address space.
    Since tag-based KASAN doesn't add tags to pointers that belong to rodata
    or vmalloc regions, this should work as is. To make sure debug checks to
    those two functions that check that the result doesn't change whether we
    operate on pointers with or without untagging has been added.

    A few other cases that don't look that interesting:

    Comparing pointers to achieve unique sorting order of pointee objects
    (e.g. sorting locks addresses before performing a double lock):

    tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
    pipe_double_lock in fs/pipe.c
    unix_state_double_lock in net/unix/af_unix.c
    lock_two_nondirectories in fs/inode.c
    mutex_lock_double in kernel/events/core.c

    ep_cmp_ffd in fs/eventpoll.c
    fsnotify_compare_groups fs/notify/mark.c

    Nothing needs to be done here, since the tags embedded into pointers
    don't change, so the sorting order would still be unique.

    Checks that a pointer belongs to some particular allocation:

    is_sibling_entry in lib/radix-tree.c
    object_is_on_stack in include/linux/sched/task_stack.h

    Nothing needs to be done here either, since two pointers can only belong
    to the same allocation if they have the same tag.

    Overall, since the kernel boots and works, there are no critical bugs.
    As for the rest, the traditional kernel testing way (use until fails) is
    the only one that looks feasible.

    Another point here is that tag-based KASAN is available under a separate
    config option that needs to be deliberately enabled. Even though it might
    be used in a "near-production" environment to find bugs that are not found
    during fuzzing or running tests, it is still a debug tool.

    ====== Benchmarks

    The following numbers were collected on Odroid C2 board. Both generic and
    tag-based KASAN were used in inline instrumentation mode.

    Boot time [1]:
    * ~1.7 sec for clean kernel
    * ~5.0 sec for generic KASAN
    * ~5.0 sec for tag-based KASAN

    Network performance [2]:
    * 8.33 Gbits/sec for clean kernel
    * 3.17 Gbits/sec for generic KASAN
    * 2.85 Gbits/sec for tag-based KASAN

    Slab memory usage after boot [3]:
    * ~40 kb for clean kernel
    * ~105 kb (~260% overhead) for generic KASAN
    * ~47 kb (~20% overhead) for tag-based KASAN

    KASAN memory overhead consists of three main parts:
    1. Increased slab memory usage due to redzones.
    2. Shadow memory (the whole reserved once during boot).
    3. Quaratine (grows gradually until some preset limit; the more the limit,
    the more the chance to detect a use-after-free).

    Comparing tag-based vs generic KASAN for each of these points:
    1. 20% vs 260% overhead.
    2. 1/16th vs 1/8th of physical memory.
    3. Tag-based KASAN doesn't require quarantine.

    [1] Time before the ext4 driver is initialized.
    [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
    [3] Measured as `cat /proc/meminfo | grep Slab`.

    ====== Some notes

    A few notes:

    1. The patchset can be found here:
    https://github.com/xairy/kasan-prototype/tree/khwasan

    2. Building requires a recent Clang version (7.0.0 or later).

    3. Stack instrumentation is not supported yet and will be added later.

    This patch (of 25):

    Tag-based KASAN changes the value of the top byte of pointers returned
    from the kernel allocation functions (such as kmalloc). This patch
    updates KASAN hooks signatures and their usage in SLAB and SLUB code to
    reflect that.

    Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Mark Rutland
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

28 Nov, 2018

1 commit

  • Now that synchronize_rcu() waits for preempt-disable regions of code
    as well as RCU read-side critical sections, synchronize_sched() can be
    replaced by synchronize_rcu(). This commit therefore makes this change.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:

    Paul E. McKenney
     

27 Oct, 2018

2 commits

  • Patch series "kmalloc-reclaimable caches", v4.

    As discussed at LSF/MM [1] here's a patchset that introduces
    kmalloc-reclaimable caches (more details in the second patch) and uses
    them for dcache external names. That allows us to repurpose the
    NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.

    With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
    caches, eliminating the need for manual accounting. More importantly, it
    also ensures the reclaimable kmalloc allocations are grouped in pages
    separate from the regular kmalloc allocations. The need for proper
    accounting of dcache external names has shown it's easy for misbehaving
    process to allocate lots of them, causing premature OOMs. Without the
    added grouping, it's likely that a similar workload can interleave the
    dcache external names allocations with regular kmalloc allocations (note:
    I haven't searched myself for an example of such regular kmalloc
    allocation, but I would be very surprised if there wasn't some). A
    pathological case would be e.g. one 64byte regular allocations with 63
    external dcache names in a page (64x64=4096), which means the page is not
    freed even after reclaiming after all dcache names, and the process can
    thus "steal" the whole page with single 64byte allocation.

    If other kmalloc users similar to dcache external names become identified,
    they can also benefit from the new functionality simply by adding
    __GFP_RECLAIMABLE to the kmalloc calls.

    Side benefits of the patchset (that could be also merged separately)
    include removed branch for detecting __GFP_DMA kmalloc(), and shortening
    kmalloc cache names in /proc/slabinfo output. The latter is potentially
    an ABI break in case there are tools parsing the names and expecting the
    values to be in bytes.

    This is how /proc/slabinfo looks like after booting in virtme:

    ...
    kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
    ...
    kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
    kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
    kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
    kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
    kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
    kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
    ...

    /proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:

    ...
    nr_slab_reclaimable 2817
    nr_slab_unreclaimable 1781
    ...
    nr_kernel_misc_reclaimable 0
    ...

    /proc/meminfo with new KReclaimable counter:

    ...
    Shmem: 564 kB
    KReclaimable: 11260 kB
    Slab: 18368 kB
    SReclaimable: 11260 kB
    SUnreclaim: 7108 kB
    KernelStack: 1248 kB
    ...

    This patch (of 6):

    The kmalloc caches currently mainain separate (optional) array
    kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
    __GFP_DMA in the allocation hotpaths. We can avoid the branches by
    combining kmalloc_caches and kmalloc_dma_caches into a single
    two-dimensional array where the outer dimension is cache "type". This
    will also allow to add kmalloc-reclaimable caches as a third type.

    Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Laura Abbott
    Cc: Sumit Semwal
    Cc: Vijayanand Jitta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE,
    instead it falls back to kmalloc_large().

    For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls
    kmalloc_slab() for all allocations relying on NULL return value for
    over-sized allocations.

    This inconsistency leads to unwanted warnings from kmalloc_slab() for
    over-sized allocations for slab. Returning NULL for failed allocations is
    the expected behavior.

    Make slub and slab code consistent by checking size >
    KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab().

    While we are here also fix the check in kmalloc_slab(). We should check
    against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE. It all kinda
    worked because for slab the constants are the same, and slub always checks
    the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab(). But if we
    get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will
    happen. For example, in case of a newly introduced bug in slub code.

    Also move the check in kmalloc_slab() from function entry to the size >
    192 case. This partially compensates for the additional check in slab
    code and makes slub code a bit faster (at least theoretically).

    Also drop __GFP_NOWARN in the warning check. This warning means a bug in
    slab code itself, user-passed flags have nothing to do with it.

    Nothing of this affects slob.

    Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.com
    Signed-off-by: Dmitry Vyukov
    Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com
    Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com
    Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com
    Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com
    Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com
    Acked-by: Christoph Lameter
    Acked-by: Vlastimil Babka
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

13 Jun, 2018

1 commit

  • The kzalloc() function has a 2-factor argument form, kcalloc(). This
    patch replaces cases of:

    kzalloc(a * b, gfp)

    with:
    kcalloc(a * b, gfp)

    as well as handling cases of:

    kzalloc(a * b * c, gfp)

    with:

    kzalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kzalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kzalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kzalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kzalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kzalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kzalloc
    + kcalloc
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kzalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kzalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kzalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kzalloc(sizeof(THING) * C2, ...)
    |
    kzalloc(sizeof(TYPE) * C2, ...)
    |
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(C1 * C2, ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

08 Jun, 2018

2 commits

  • rcu_head may now grow larger than list_head without affecting slab or
    slub.

    Link: http://lkml.kernel.org/r/20180518194519.3820-15-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Christoph Lameter
    Acked-by: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: "Kirill A . Shutemov"
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • __GFP_ZERO requests that the object be initialised to all-zeroes, while
    the purpose of a constructor is to initialise an object to a particular
    pattern. We cannot do both. Add a warning to catch any users who
    mistakenly pass a __GFP_ZERO flag when allocating a slab with a
    constructor.

    Link: http://lkml.kernel.org/r/20180412191322.GA21205@bombadil.infradead.org
    Fixes: d07dbea46405 ("Slab allocators: support __GFP_ZERO in all allocators")
    Signed-off-by: Matthew Wilcox
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

14 Apr, 2018

1 commit

  • cache_reap() is initially scheduled in start_cpu_timer() via
    schedule_delayed_work_on(). But then the next iterations are scheduled
    via schedule_delayed_work(), i.e. using WORK_CPU_UNBOUND.

    Thus since commit ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND
    work on wq_unbound_cpumask CPUs") there is no guarantee the future
    iterations will run on the originally intended cpu, although it's still
    preferred. I was able to demonstrate this with
    /sys/module/workqueue/parameters/debug_force_rr_cpu. IIUC, it may also
    happen due to migrating timers in nohz context. As a result, some cpu's
    would be calling cache_reap() more frequently and others never.

    This patch uses schedule_delayed_work_on() with the current cpu when
    scheduling the next iteration.

    Link: http://lkml.kernel.org/r/20180411070007.32225-1-vbabka@suse.cz
    Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
    Signed-off-by: Vlastimil Babka
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Tejun Heo
    Cc: Lai Jiangshan
    Cc: John Stultz
    Cc: Thomas Gleixner
    Cc: Stephen Boyd
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

06 Apr, 2018

4 commits

  • The kasan quarantine is designed to delay freeing slab objects to catch
    use-after-free. The quarantine can be large (several percent of machine
    memory size). When kmem_caches are deleted related objects are flushed
    from the quarantine but this requires scanning the entire quarantine
    which can be very slow. We have seen the kernel busily working on this
    while holding slab_mutex and badly affecting cache_reaper, slabinfo
    readers and memcg kmem cache creations.

    It can easily reproduced by following script:

    yes . | head -1000000 | xargs stat > /dev/null
    for i in `seq 1 10`; do
    seq 500 | (cd /cg/memory && xargs mkdir)
    seq 500 | xargs -I{} sh -c 'echo $BASHPID > \
    /cg/memory/{}/tasks && exec stat .' > /dev/null
    seq 500 | (cd /cg/memory && xargs rmdir)
    done

    The busy stack:
    kasan_cache_shutdown
    shutdown_cache
    memcg_destroy_kmem_caches
    mem_cgroup_css_free
    css_free_rwork_fn
    process_one_work
    worker_thread
    kthread
    ret_from_fork

    This patch is based on the observation that if the kmem_cache to be
    destroyed is empty then there should not be any objects of this cache in
    the quarantine.

    Without the patch the script got stuck for couple of hours. With the
    patch the script completed within a second.

    Link: http://lkml.kernel.org/r/20180327230603.54721-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Reviewed-by: Andrew Morton
    Acked-by: Andrey Ryabinin
    Acked-by: Christoph Lameter
    Cc: Vladimir Davydov
    Cc: Alexander Potapenko
    Cc: Greg Thelen
    Cc: Dmitry Vyukov
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • If SLAB doesn't support 4GB+ kmem caches (it never did), KASAN should
    not do it as well.

    Link: http://lkml.kernel.org/r/20180305200730.15812-20-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Now that all sizes are properly typed, propagate "unsigned int" down the
    callgraph.

    Link: http://lkml.kernel.org/r/20180305200730.15812-19-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • struct kmem_cache::size and ::align were always 32-bit.

    Out of curiosity I created 4GB kmem_cache, it oopsed with division by 0.
    kmem_cache_create(1UL<
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan