22 Feb, 2018

2 commits

  • commit 75f296d93bcebcfe375884ddac79e30263a31766 upstream.

    Convert all allocations that used a NOTRACK flag to stop using it.

    Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     
  • commit 4950276672fce5c241857540f8561c440663673d upstream.

    Patch series "kmemcheck: kill kmemcheck", v2.

    As discussed at LSF/MM, kill kmemcheck.

    KASan is a replacement that is able to work without the limitation of
    kmemcheck (single CPU, slow). KASan is already upstream.

    We are also not aware of any users of kmemcheck (or users who don't
    consider KASan as a suitable replacement).

    The only objection was that since KASAN wasn't supported by all GCC
    versions provided by distros at that time we should hold off for 2
    years, and try again.

    Now that 2 years have passed, and all distros provide gcc that supports
    KASAN, kill kmemcheck again for the very same reasons.

    This patch (of 4):

    Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

    [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
    Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
    Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

25 Dec, 2017

1 commit

  • commit 3382290ed2d5e275429cef510ab21889d3ccd164 upstream.

    [ Note, this is a Git cherry-pick of the following commit:

    506458efaf15 ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")

    ... for easier x86 PTI code testing and back-porting. ]

    READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
    can be used instead of lockless_dereference() without any change in
    semantics.

    Signed-off-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

10 Aug, 2017

1 commit

  • A while ago someone, and I cannot find the email just now, asked if we
    could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
    like we use for other things like workqueues etc. I think this should
    be possible which allows reducing the 'irq' states and will reduce the
    amount of __bfs() lookups we do.

    Removing the 1 IRQ state results in 4 less __bfs() walks per
    dependency, improving lockdep performance. And by moving this
    annotation out of the lockdep code it becomes easier for the mm people
    to extend.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Nikolay Borisov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: boqun.feng@gmail.com
    Cc: iamjoonsoo.kim@lge.com
    Cc: kernel-team@lge.com
    Cc: kirill@shutemov.name
    Cc: npiggin@gmail.com
    Cc: walken@google.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Jul, 2017

3 commits

  • Josef's redesign of the balancing between slab caches and the page cache
    requires slab cache statistics at the lruvec level.

    Link: http://lkml.kernel.org/r/20170530181724.27197-7-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The kmem-specific functions do the same thing. Switch and drop.

    Link: http://lkml.kernel.org/r/20170530181724.27197-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that the slab counters are moved from the zone to the node level we
    can drop the private memcg node stats and use the official ones.

    Link: http://lkml.kernel.org/r/20170530181724.27197-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

19 Apr, 2017

1 commit

  • A group of Linux kernel hackers reported chasing a bug that resulted
    from their assumption that SLAB_DESTROY_BY_RCU provided an existence
    guarantee, that is, that no block from such a slab would be reallocated
    during an RCU read-side critical section. Of course, that is not the
    case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
    slab of blocks.

    However, there is a phrase for this, namely "type safety". This commit
    therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
    to avoid future instances of this sort of confusion.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    [ paulmck: Add comments mentioning the old name, as requested by Eric
    Dumazet, in order to help people familiar with the old name find
    the new one. ]
    Acked-by: David Rientjes

    Paul E. McKenney
     

23 Feb, 2017

7 commits

  • With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code. This is one of the patches to address the issue.

    slub uses synchronize_sched() to deactivate a memcg cache.
    synchronize_sched() is an expensive and slow operation and doesn't scale
    when a huge number of caches are destroyed back-to-back. While there
    used to be a simple batching mechanism, the batching was too restricted
    to be helpful.

    This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub
    can use to schedule sched RCU callback instead of performing
    synchronize_sched() synchronously while holding cgroup_mutex. While
    this adds online cpus, mems and slab_mutex operations, operating on
    these locks back-to-back from the same kworker, which is what's gonna
    happen when there are many to deactivate, isn't expensive at all and
    this gets rid of the scalability problem completely.

    Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • __kmem_cache_shrink() is called with %true @deactivate only for memcg
    caches. Remove @deactivate from __kmem_cache_shrink() and introduce
    __kmemcg_cache_deactivate() instead. Each memcg-supporting allocator
    should implement it and it should deactivate and drain the cache.

    This is to allow memcg cache deactivation behavior to further deviate
    from simple shrinking without messing up __kmem_cache_shrink().

    This is pure reorganization and doesn't introduce any observable
    behavior changes.

    v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.

    Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.org
    Signed-off-by: Tejun Heo
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code. This is one of the patches to address the issue.

    slab_caches currently lists all caches including root and memcg ones.
    This is the only data structure which lists the root caches and
    iterating root caches can only be done by walking the list while
    skipping over memcg caches. As there can be a huge number of memcg
    caches, this can become very expensive.

    This also can make /proc/slabinfo behave very badly. seq_file processes
    reads in 4k chunks and seeks to the previous Nth position on slab_caches
    list to resume after each chunk. With a lot of memcg cache churns on
    the list, reading /proc/slabinfo can become very slow and its content
    often ends up with duplicate and/or missing entries.

    This patch adds a new list slab_root_caches which lists only the root
    caches. When memcg is not enabled, it becomes just an alias of
    slab_caches. memcg specific list operations are collected into
    memcg_[un]link_cache().

    Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code. This is one of the patches to address the issue.

    While a memcg kmem_cache is listed on its root cache's ->children list,
    there is no direct way to iterate all kmem_caches which are assocaited
    with a memory cgroup. The only way to iterate them is walking all
    caches while filtering out caches which don't match, which would be most
    of them.

    This makes memcg destruction operations O(N^2) where N is the total
    number of slab caches which can be huge. This combined with the
    synchronous RCU operations can tie up a CPU and affect the whole machine
    for many hours when memory reclaim triggers offlining and destruction of
    the stale memcgs.

    This patch adds mem_cgroup->kmem_caches list which goes through
    memcg_cache_params->kmem_caches_node of all kmem_caches which are
    associated with the memcg. All memcg specific iterations, including
    stat file access, are updated to use the new list instead.

    Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • We're going to change how memcg caches are iterated. In preparation,
    clean up and reorganize memcg_cache_params.

    * The shared ->list is replaced by ->children in root and
    ->children_node in children.

    * ->is_root_cache is removed. Instead ->root_cache is moved out of
    the child union and now used by both root and children. NULL
    indicates root cache. Non-NULL a memcg one.

    This patch doesn't cause any observable behavior changes.

    Link: http://lkml.kernel.org/r/20170117235411.9408-5-tj@kernel.org
    Signed-off-by: Tejun Heo
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Patch series "slab: make memcg slab destruction scalable", v3.

    With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code.

    I've seen machines which end up with hundred thousands of caches and
    many millions of kernfs_nodes. The current code is O(N^2) on the total
    number of caches and has synchronous rcu_barrier() and
    synchronize_sched() in cgroup offline / release path which is executed
    while holding cgroup_mutex. Combined, this leads to very expensive and
    slow cache destruction operations which can easily keep running for half
    a day.

    This also messes up /proc/slabinfo along with other cache iterating
    operations. seq_file operates on 4k chunks and on each 4k boundary
    tries to seek to the last position in the list. With a huge number of
    caches on the list, this becomes very slow and very prone to the list
    content changing underneath it leading to a lot of missing and/or
    duplicate entries.

    This patchset addresses the scalability problem.

    * Add root and per-memcg lists. Update each user to use the
    appropriate list.

    * Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched
    and asynchronous.

    * For dying empty slub caches, remove the sysfs files after
    deactivation so that we don't end up with millions of sysfs files
    without any useful information on them.

    This patchset contains the following nine patches.

    0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
    0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch
    0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
    0004-slab-reorganize-memcg_cache_params.patch
    0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
    0006-slab-implement-slab_root_caches-list.patch
    0007-slab-introduce-__kmemcg_cache_deactivate.patch
    0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
    0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch
    0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch

    0001 reverts an existing optimization to prepare for the following
    changes. 0002 is a prep patch. 0003 makes rcu_barrier() in release
    path batched and asynchronous. 0004-0006 separate out the lists.
    0007-0008 replace synchronize_sched() in slub destruction path with
    call_rcu_sched(). 0009 removes sysfs files early for empty dying
    caches. 0010 makes destruction work items use a workqueue with limited
    concurrency.

    This patch (of 10):

    Revert 89e364db71fb5e ("slub: move synchronize_sched out of slab_mutex on
    shrink").

    With kmem cgroup support enabled, kmem_caches can be created and destroyed
    frequently and a great number of near empty kmem_caches can accumulate if
    there are a lot of transient cgroups and the system is not under memory
    pressure. When memory reclaim starts under such conditions, it can lead
    to consecutive deactivation and destruction of many kmem_caches, easily
    hundreds of thousands on moderately large systems, exposing scalability
    issues in the current slab management code. This is one of the patches to
    address the issue.

    Moving synchronize_sched() out of slab_mutex isn't enough as it's still
    inside cgroup_mutex. The whole deactivation / release path will be
    updated to avoid all synchronous RCU operations. Revert this insufficient
    optimization in preparation to ease future changes.

    Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Cc: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • SLAB as part of its bootstrap pre-creates one kmalloc cache that can fit
    the kmem_cache_node management structure, and puts it into the generic
    kmalloc cache array (e.g. for 128b objects). The name of this cache is
    "kmalloc-node", which is confusing for readers of /proc/slabinfo as the
    cache is used for generic allocations (and not just the kmem_cache_node
    struct) and it appears as the kmalloc-128 cache is missing.

    An easy solution is to use the kmalloc- name when pre-creating the
    cache, which we can get from the kmalloc_info array.

    Example /proc/slabinfo before the patch:

    ...
    kmalloc-256 1647 1984 256 16 1 : tunables 120 60 8 : slabdata 124 124 828
    kmalloc-192 1974 1974 192 21 1 : tunables 120 60 8 : slabdata 94 94 133
    kmalloc-96 1332 1344 128 32 1 : tunables 120 60 8 : slabdata 42 42 219
    kmalloc-64 2505 5952 64 64 1 : tunables 120 60 8 : slabdata 93 93 715
    kmalloc-32 4278 4464 32 124 1 : tunables 120 60 8 : slabdata 36 36 346
    kmalloc-node 1352 1376 128 32 1 : tunables 120 60 8 : slabdata 43 43 53
    kmem_cache 132 147 192 21 1 : tunables 120 60 8 : slabdata 7 7 0

    After the patch:

    ...
    kmalloc-256 1672 2160 256 16 1 : tunables 120 60 8 : slabdata 135 135 807
    kmalloc-192 1992 2016 192 21 1 : tunables 120 60 8 : slabdata 96 96 203
    kmalloc-96 1159 1184 128 32 1 : tunables 120 60 8 : slabdata 37 37 116
    kmalloc-64 2561 4864 64 64 1 : tunables 120 60 8 : slabdata 76 76 785
    kmalloc-32 4253 4340 32 124 1 : tunables 120 60 8 : slabdata 35 35 270
    kmalloc-128 1256 1280 128 32 1 : tunables 120 60 8 : slabdata 40 40 39
    kmem_cache 125 147 192 21 1 : tunables 120 60 8 : slabdata 7 7 0

    [vbabka@suse.cz: export the whole kmalloc_info structure instead of just a name accessor, per Christoph Lameter]
    Link: http://lkml.kernel.org/r/54e80303-b814-4232-66d4-95b34d3eb9d0@suse.cz
    Link: http://lkml.kernel.org/r/20170203181008.24898-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Matthew Wilcox
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

13 Dec, 2016

4 commits

  • Rather than tracking the number of active slabs for each node, track the
    total number of slabs. This is a minor improvement that avoids active
    slab tracking when a slab goes from free to partial or partial to free.

    For slab debugging, this also removes an explicit free count since it
    can easily be inferred by the difference in number of total objects and
    number of active objects.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612042020110.115755@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Suggested-by: Joonsoo Kim
    Cc: Greg Thelen
    Cc: Aruna Ramakrishna
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Reading /proc/slabinfo or monitoring slabtop(1) can become very
    expensive if there are many slab caches and if there are very lengthy
    per-node partial and/or free lists.

    Commit 07a63c41fa1f ("mm/slab: improve performance of gathering slabinfo
    stats") addressed the per-node full lists which showed a significant
    improvement when no objects were freed. This patch has the same
    motivation and optimizes the remainder of the usecases where there are
    very lengthy partial and free lists.

    This patch maintains per-node active_slabs (full and partial) and
    free_slabs rather than iterating the lists at runtime when reading
    /proc/slabinfo.

    When allocating 100GB of slab from a test cache where every slab page is
    on the partial list, reading /proc/slabinfo (includes all other slab
    caches on the system) takes ~247ms on average with 48 samples.

    As a result of this patch, the same read takes ~0.856ms on average.

    [rientjes@google.com: changelog]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1611081505240.13403@chino.kir.corp.google.com
    Signed-off-by: Greg Thelen
    Signed-off-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Verify that kmem_create_cache flags are not allocator specific. It is
    done before removing flags that are not available with the current
    configuration.

    The current kmem_cache_create removes incorrect flags but do not
    validate the callers are using them right. This change will ensure that
    callers are not trying to create caches with flags that won't be used
    because allocator specific.

    Link: http://lkml.kernel.org/r/1478553075-120242-2-git-send-email-thgarnie@google.com
    Signed-off-by: Thomas Garnier
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     
  • synchronize_sched() is a heavy operation and calling it per each cache
    owned by a memory cgroup being destroyed may take quite some time. What
    is worse, it's currently called under the slab_mutex, stalling all works
    doing cache creation/destruction.

    Actually, there isn't much point in calling synchronize_sched() for each
    cache - it's enough to call it just once - after setting cpu_partial for
    all caches and before shrinking them. This way, we can also move it out
    of the slab_mutex, which we have to hold for iterating over the slab
    cache list.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
    Link: http://lkml.kernel.org/r/0a10d71ecae3db00fb4421bcd3f82bcc911f4be4.1475329751.git.vdavydov.dev@gmail.com
    Signed-off-by: Vladimir Davydov
    Reported-by: Doug Smythies
    Acked-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

28 Oct, 2016

1 commit

  • On large systems, when some slab caches grow to millions of objects (and
    many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2
    seconds. During this time, interrupts are disabled while walking the
    slab lists (slabs_full, slabs_partial, and slabs_free) for each node,
    and this sometimes causes timeouts in other drivers (for instance,
    Infiniband).

    This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
    total number of allocated slabs per node, per cache. This counter is
    updated when a slab is created or destroyed. This enables us to skip
    traversing the slabs_full list while gathering slabinfo statistics, and
    since slabs_full tends to be the biggest list when the cache is large,
    it results in a dramatic performance improvement. Getting slabinfo
    statistics now only requires walking the slabs_free and slabs_partial
    lists, and those lists are usually much smaller than slabs_full.

    We tested this after growing the dentry cache to 70GB, and the
    performance improved from 2s to 5ms.

    Link: http://lkml.kernel.org/r/1472517876-26814-1-git-send-email-aruna.ramakrishna@oracle.com
    Signed-off-by: Aruna Ramakrishna
    Acked-by: David Rientjes
    Cc: Mike Kravetz
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aruna Ramakrishna
     

29 Jul, 2016

1 commit

  • For KASAN builds:
    - switch SLUB allocator to using stackdepot instead of storing the
    allocation/deallocation stacks in the objects;
    - change the freelist hook so that parts of the freelist can be put
    into the quarantine.

    [aryabinin@virtuozzo.com: fixes]
    Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Dmitry Vyukov
    Cc: Steven Rostedt (Red Hat)
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Andrey Ryabinin
    Cc: Kuthonuzo Luruo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

27 Jul, 2016

2 commits

  • - Handle memcg_kmem_enabled check out to the caller. This reduces the
    number of function definitions making the code easier to follow. At
    the same time it doesn't result in code bloat, because all of these
    functions are used only in one or two places.

    - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
    have to dive deep into memcg implementation to see which allocations
    are charged and which are not.

    - Refresh comments.

    Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The kernel heap allocators are using a sequential freelist making their
    allocation predictable. This predictability makes kernel heap overflow
    easier to exploit. An attacker can careful prepare the kernel heap to
    control the following chunk overflowed.

    For example these attacks exploit the predictability of the heap:
    - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
    - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

    ***Problems that needed solving:
    - Randomize the Freelist (singled linked) used in the SLUB allocator.
    - Ensure good performance to encourage usage.
    - Get best entropy in early boot stage.

    ***Parts:
    - 01/02 Reorganize the SLAB Freelist randomization to share elements
    with the SLUB implementation.
    - 02/02 The SLUB Freelist randomization implementation. Similar approach
    than the SLAB but tailored to the singled freelist used in SLUB.

    ***Performance data:

    slab_test impact is between 3% to 4% on average for 100000 attempts
    without smp. It is a very focused testing, kernbench show the overall
    impact on the system is way lower.

    Before:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
    100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
    100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
    100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
    100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
    100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
    100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
    100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
    100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
    100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 70 cycles
    100000 times kmalloc(16)/kfree -> 70 cycles
    100000 times kmalloc(32)/kfree -> 70 cycles
    100000 times kmalloc(64)/kfree -> 70 cycles
    100000 times kmalloc(128)/kfree -> 70 cycles
    100000 times kmalloc(256)/kfree -> 69 cycles
    100000 times kmalloc(512)/kfree -> 70 cycles
    100000 times kmalloc(1024)/kfree -> 73 cycles
    100000 times kmalloc(2048)/kfree -> 72 cycles
    100000 times kmalloc(4096)/kfree -> 71 cycles

    After:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
    100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
    100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
    100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
    100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
    100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
    100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
    100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
    100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
    100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 66 cycles
    100000 times kmalloc(16)/kfree -> 66 cycles
    100000 times kmalloc(32)/kfree -> 66 cycles
    100000 times kmalloc(64)/kfree -> 66 cycles
    100000 times kmalloc(128)/kfree -> 65 cycles
    100000 times kmalloc(256)/kfree -> 67 cycles
    100000 times kmalloc(512)/kfree -> 67 cycles
    100000 times kmalloc(1024)/kfree -> 64 cycles
    100000 times kmalloc(2048)/kfree -> 67 cycles
    100000 times kmalloc(4096)/kfree -> 67 cycles

    Kernbench, before:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 101.873 (1.16069)
    User Time 1045.22 (1.60447)
    System Time 88.969 (0.559195)
    Percent CPU 1112.9 (13.8279)
    Context Switches 189140 (2282.15)
    Sleeps 99008.6 (768.091)

    After:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 102.47 (0.562732)
    User Time 1045.3 (1.34263)
    System Time 88.311 (0.342554)
    Percent CPU 1105.8 (6.49444)
    Context Switches 189081 (2355.78)
    Sleeps 99231.5 (800.358)

    This patch (of 2):

    This commit reorganizes the previous SLAB freelist randomization to
    prepare for the SLUB implementation. It moves functions that will be
    shared to slab_common.

    The entropy functions are changed to align with the SLUB implementation,
    now using get_random_(int|long) functions. These functions were chosen
    because they provide a bit more entropy early on boot and better
    performance when specific arch instructions are not available.

    [akpm@linux-foundation.org: fix build]
    Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.com
    Signed-off-by: Thomas Garnier
    Reviewed-by: Kees Cook
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     

21 May, 2016

1 commit

  • Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    When the object is freed, its state changes from KASAN_STATE_ALLOC to
    KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine
    instead of being returned to the allocator, therefore every subsequent
    access to that object triggers a KASAN error, and the error handler is
    able to say where the object has been allocated and deallocated.

    When it's time for the object to leave quarantine, its state becomes
    KASAN_STATE_FREE and it's returned to the allocator. From now on the
    allocator may reuse it for another allocation. Before that happens,
    it's still possible to detect a use-after free on that object (it
    retains the allocation/deallocation stacks).

    When the allocator reuses this object, the shadow is unpoisoned and old
    allocation/deallocation stacks are wiped. Therefore a use of this
    object, even an incorrect one, won't trigger ASan warning.

    Without the quarantine, it's not guaranteed that the objects aren't
    reused immediately, that's why the probability of catching a
    use-after-free is lower than with quarantine in place.

    Quarantine isolates freed objects in a separate queue. The objects are
    returned to the allocator later, which helps to detect use-after-free
    errors.

    Freed objects are first added to per-cpu quarantine queues. When a
    cache is destroyed or memory shrinking is requested, the objects are
    moved into the global quarantine queue. Whenever a kmalloc call allows
    memory reclaiming, the oldest objects are popped out of the global queue
    until the total size of objects in quarantine is less than 3/4 of the
    maximum quarantine size (which is a fraction of installed physical
    memory).

    As long as an object remains in the quarantine, KASAN is able to report
    accesses to it, so the chance of reporting a use-after-free is
    increased. Once the object leaves quarantine, the allocator may reuse
    it, in which case the object is unpoisoned and KASAN can't detect
    incorrect accesses to it.

    Right now quarantine support is only enabled in SLAB allocator.
    Unification of KASAN features in SLAB and SLUB will be done later.

    This patch is based on the "mm: kasan: quarantine" patch originally
    prepared by Dmitry Chernenkov. A number of improvements have been
    suggested by Andrey Ryabinin.

    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

26 Mar, 2016

1 commit

  • Add GFP flags to KASAN hooks for future patches to use.

    This patch is based on the "mm: kasan: unified support for SLUB and SLAB
    allocators" patch originally prepared by Dmitry Chernenkov.

    Signed-off-by: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Steven Rostedt
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

18 Mar, 2016

1 commit


16 Mar, 2016

4 commits

  • SLAB_DEBUG_FREE allows expensive consistency checks at free to be turned
    on or off. Expand its use to be able to turn off all consistency
    checks. This gives a nice speed up if you only want features such as
    poisoning or tracing.

    Credit to Mathias Krause for the original work which inspired this
    series

    Signed-off-by: Laura Abbott
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Mathias Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • Fix up trivial spelling errors, noticed while reading the code.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • Remove the SLAB specific function slab_should_failslab(), by moving the
    check against fault-injection for the bootstrap slab, into the shared
    function should_failslab() (used by both SLAB and SLUB).

    This is a step towards sharing alloc_hook's between SLUB and SLAB.

    This bootstrap slab "kmem_cache" is used for allocating struct
    kmem_cache objects to the allocator itself.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • First step towards sharing alloc_hook's between SLUB and SLAB
    allocators. Move the SLUB allocators *_alloc_hook to the common
    mm/slab.h for internal slab definitions.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

19 Feb, 2016

1 commit

  • When slub_debug alloc_calls_show is enabled we will try to track
    location and user of slab object on each online node, kmem_cache_node
    structure and cpu_cache/cpu_slub shouldn't be freed till there is the
    last reference to sysfs file.

    This fixes the following panic:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
    IP: list_locations+0x169/0x4e0
    PGD 257304067 PUD 438456067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30
    Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011
    task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000
    RIP: list_locations+0x169/0x4e0
    Call Trace:
    alloc_calls_show+0x1d/0x30
    slab_attr_show+0x1b/0x30
    sysfs_read_file+0x9a/0x1a0
    vfs_read+0x9c/0x170
    SyS_read+0x58/0xb0
    system_call_fastpath+0x16/0x1b
    Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10
    CR2: 0000000000000020

    Separated __kmem_cache_release from __kmem_cache_shutdown which now
    called on slab_kmem_cache_release (after the last reference to sysfs
    file object has dropped).

    Reintroduced locking in free_partial as sysfs file might access cache's
    partial list after shutdowning - partial revert of the commit
    69cb8e6b7c29 ("slub: free slabs without holding locks"). Zap
    __remove_partial and use remove_partial (w/o underscores) as
    free_partial now takes list_lock which s partial revert for commit
    1e4dd9461fab ("slub: do not assert not having lock in removing freed
    partial")

    Signed-off-by: Dmitry Safonov
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     

21 Jan, 2016

1 commit


15 Jan, 2016

1 commit

  • Currently, if we want to account all objects of a particular kmem cache,
    we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
    inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
    to kmem_cache_create will force accounting for every allocation from
    this cache even if __GFP_ACCOUNT is not passed.

    This patch does not make any of the existing caches use this flag - it
    will be done later in the series.

    Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
    SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
    hence cannot have different sets of SLAB_* flags. Thus using this flag
    will probably reduce the number of merged slabs even if kmem accounting
    is not used (only compiled in).

    Signed-off-by: Vladimir Davydov
    Suggested-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

23 Nov, 2015

1 commit

  • Adjust kmem_cache_alloc_bulk API before we have any real users.

    Adjust API to return type 'int' instead of previously type 'bool'. This
    is done to allow future extension of the bulk alloc API.

    A future extension could be to allow SLUB to stop at a page boundary, when
    specified by a flag, and then return the number of objects.

    The advantage of this approach, would make it easier to make bulk alloc
    run without local IRQs disabled. With an approach of cmpxchg "stealing"
    the entire c->freelist or page->freelist. To avoid overshooting we would
    stop processing at a slab-page boundary. Else we always end up returning
    some objects at the cost of another cmpxchg.

    To keep compatible with future users of this API linking against an older
    kernel when using the new flag, we need to return the number of allocated
    objects with this API change.

    Signed-off-by: Jesper Dangaard Brouer
    Cc: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     

06 Nov, 2015

2 commits

  • We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
    uncharging kmem pages to memcg, but currently they are not used for
    charging slab pages (i.e. they are only used for charging pages allocated
    with alloc_kmem_pages). The only reason why the slab subsystem uses
    special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
    needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
    to the memcg that the current task belongs to.

    To remove this diversity, this patch adds an extra argument to
    __memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is
    not NULL, the function tries to charge to the memcg it points to,
    otherwise it charge to the current context. Next, it makes the slab
    subsystem use this function to charge slab pages.

    Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
    in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since
    __memcg_kmem_charge stores a pointer to the memcg in the page struct, we
    don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
    Besides, one can now detect which memcg a slab page belongs to by reading
    /proc/kpagecgroup.

    Note, this patch switches slab to charge-after-alloc design. Since this
    design is already used for all other memcg charges, it should not make any
    difference.

    [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, we do not clear pointers to per memcg caches in the
    memcg_params.memcg_caches array when a global cache is destroyed with
    kmem_cache_destroy.

    This is fine if the global cache does get destroyed. However, a cache can
    be left on the list if it still has active objects when kmem_cache_destroy
    is called (due to a memory leak). If this happens, the entries in the
    array will point to already freed areas, which is likely to result in data
    corruption when the cache is reused (via slab merging).

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

05 Sep, 2015

2 commits

  • While debugging a networking issue, I hit a condition that triggered an
    object to be freed into the wrong kmem cache, and thus triggered the
    warning in cache_from_obj().

    The arguments in the error message are in wrong order: the location
    of the object's kmem cache is in cachep, not s.

    Signed-off-by: Daniel Borkmann
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Borkmann
     
  • Add the basic infrastructure for alloc/free operations on pointer arrays.
    It includes a generic function in the common slab code that is used in
    this infrastructure patch to create the unoptimized functionality for slab
    bulk operations.

    Allocators can then provide optimized allocation functions for situations
    in which large numbers of objects are needed. These optimization may
    avoid taking locks repeatedly and bypass metadata creation if all objects
    in slab pages can be used to provide the objects required.

    Allocators can extend the skeletons provided and add their own code to the
    bulk alloc and free functions. They can keep the generic allocation and
    freeing and just fall back to those if optimizations would not work (like
    for example when debugging is on).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

25 Jun, 2015

1 commit

  • This patch moves the initialization of the size_index table slightly
    earlier so that the first few kmem_cache_node's can be safely allocated
    when KMALLOC_MIN_SIZE is large.

    There are currently two ways to generate indices into kmalloc_caches (via
    kmalloc_index() and via the size_index table in slab_common.c) and on some
    arches (possibly only MIPS) they potentially disagree with each other
    until create_kmalloc_caches() has been called. It seems that the
    intention is that the size_index table is a fast equivalent to
    kmalloc_index() and that create_kmalloc_caches() patches the table to
    return the correct value for the cases where kmalloc_index()'s
    if-statements apply.

    The failing sequence was:
    * kmalloc_caches contains NULL elements
    * kmem_cache_init initialises the element that 'struct
    kmem_cache_node' will be allocated to. For 32-bit Mips, this is a
    56-byte struct and kmalloc_index returns KMALLOC_SHIFT_LOW (7).
    * init_list is called which calls kmalloc_node to allocate a 'struct
    kmem_cache_node'.
    * kmalloc_slab selects the kmem_caches element using
    size_index[size_index_elem(size)]. For MIPS, size is 56, and the
    expression returns 6.
    * This element of kmalloc_caches is NULL and allocation fails.
    * If it had not already failed, it would have called
    create_kmalloc_caches() at this point which would have changed
    size_index[size_index_elem(size)] to 7.

    I don't believe the bug to be LLVM specific but GCC doesn't normally
    encounter the problem. I haven't been able to identify exactly what GCC
    is doing better (probably inlining) but it seems that GCC is managing to
    optimize to the point that it eliminates the problematic allocations.
    This theory is supported by the fact that GCC can be made to fail in the
    same way by changing inline, __inline, __inline__, and __always_inline in
    include/linux/compiler-gcc.h such that they don't actually inline things.

    Signed-off-by: Daniel Sanders
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Sanders