23 Oct, 2015

1 commit

  • commit 03a2d2a3eafe4015412cf4e9675ca0e2d9204074 upstream.

    Commit description is copied from the original post of this bug:

    http://comments.gmane.org/gmane.linux.kernel.mm/135349

    Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next
    larger cache size than the size index INDEX_NODE mapping. In kernels
    3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size.

    However, sometimes we can't get the right output we expected via
    kmalloc_size(INDEX_NODE + 1), causing a BUG().

    The mapping table in the latest kernel is like:
    index = {0, 1, 2 , 3, 4, 5, 6, n}
    size = {0, 96, 192, 8, 16, 32, 64, 2^n}
    The mapping table before 3.10 is like this:
    index = {0 , 1 , 2, 3, 4 , 5 , 6, n}
    size = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)}

    The problem on my mips64 machine is as follows:

    (1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC
    && DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150",
    and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE
    kmalloc_index(sizeof(struct kmem_cache_node))

    (2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8.

    (3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size
    = PAGE_SIZE".

    (4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and
    "flags |= CFLGS_OFF_SLAB" will be covered.

    (5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to
    "cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result
    here may be NULL while kernel bootup.

    (6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the
    BUG info as the following shows (may be only mips64 has this problem):

    This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes
    the BUG by adding 'size >= 256' check to guarantee that all necessary
    small sized slabs are initialized regardless sequence of slab size in
    mapping table.

    Fixes: e33660165c90 ("slab: Use common kmalloc_index/kmalloc_size...")
    Signed-off-by: Joonsoo Kim
    Reported-by: Liuhailong
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joonsoo Kim
     

30 Sep, 2015

1 commit

  • commit 2f064f3485cd29633ad1b3cfb00cc519509a3d72 upstream.

    Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added
    checks for page->pfmemalloc to __skb_fill_page_desc():

    if (page->pfmemalloc && !page->mapping)
    skb->pfmemalloc = true;

    It assumes page->mapping == NULL implies that page->pfmemalloc can be
    trusted. However, __delete_from_page_cache() can set set page->mapping
    to NULL and leave page->index value alone. Due to being in union, a
    non-zero page->index will be interpreted as true page->pfmemalloc.

    So the assumption is invalid if the networking code can see such a page.
    And it seems it can. We have encountered this with a NFS over loopback
    setup when such a page is attached to a new skbuf. There is no copying
    going on in this case so the page confuses __skb_fill_page_desc which
    interprets the index as pfmemalloc flag and the network stack drops
    packets that have been allocated using the reserves unless they are to
    be queued on sockets handling the swapping which is the case here and
    that leads to hangs when the nfs client waits for a response from the
    server which has been dropped and thus never arrive.

    The struct page is already heavily packed so rather than finding another
    hole to put it in, let's do a trick instead. We can reuse the index
    again but define it to an impossible value (-1UL). This is the page
    index so it should never see the value that large. Replace all direct
    users of page->pfmemalloc by page_is_pfmemalloc which will hide this
    nastiness from unspoiled eyes.

    The information will get lost if somebody wants to use page->index
    obviously but that was the case before and the original code expected
    that the information should be persisted somewhere else if that is
    really needed (e.g. what SLAB and SLUB do).

    [akpm@linux-foundation.org: fix blooper in slub]
    Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb")
    Signed-off-by: Michal Hocko
    Debugged-by: Vlastimil Babka
    Debugged-by: Jiri Bohac
    Cc: Eric Dumazet
    Cc: David Miller
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

15 Apr, 2015

1 commit

  • NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.

    GFP_THISNODE is a secret combination of gfp bits that have different
    behavior than expected. It is a combination of __GFP_THISNODE,
    __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page
    allocator slowpath to fail without trying reclaim even though it may be
    used in combination with __GFP_WAIT.

    An example of the problem this creates: commit e97ca8e5b864 ("mm: fix
    GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
    that really just wanted __GFP_THISNODE. The problem doesn't end there,
    however, because even it was a no-op for alloc_misplaced_dst_page(),
    which also sets __GFP_NORETRY and __GFP_NOWARN, and
    migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
    is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a
    no-op in these cases since the page allocator special-cases
    __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN.

    It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE
    to restrict an allocation to a local node, but remove GFP_THISNODE and
    its obscurity. Instead, we require that a caller clear __GFP_WAIT if it
    wants to avoid reclaim.

    This allows the aforementioned functions to actually reclaim as they
    should. It also enables any future callers that want to do
    __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The
    rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.

    Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
    it is unchanged.

    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Acked-by: Pekka Enberg
    Cc: Joonsoo Kim
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Pravin Shelar
    Cc: Jarno Rajahalme
    Cc: Li Zefan
    Cc: Greg Thelen
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 Feb, 2015

2 commits

  • To speed up further allocations SLUB may store empty slabs in per cpu/node
    partial lists instead of freeing them immediately. This prevents per
    memcg caches destruction, because kmem caches created for a memory cgroup
    are only destroyed after the last page charged to the cgroup is freed.

    To fix this issue, this patch resurrects approach first proposed in [1].
    It forbids SLUB to cache empty slabs after the memory cgroup that the
    cache belongs to was destroyed. It is achieved by setting kmem_cache's
    cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
    that it would drop frozen empty slabs immediately if cpu_partial = 0.

    The runtime overhead is minimal. From all the hot functions, we only
    touch relatively cold put_cpu_partial(): we make it call
    unfreeze_partials() after freezing a slab that belongs to an offline
    memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
    partial list on free/alloc, and there can't be allocations from dead
    caches, it shouldn't cause any overhead. We do have to disable preemption
    for put_cpu_partial() to achieve that though.

    The original patch was accepted well and even merged to the mm tree.
    However, I decided to withdraw it due to changes happening to the memcg
    core at that time. I had an idea of introducing per-memcg shrinkers for
    kmem caches, but now, as memcg has finally settled down, I do not see it
    as an option, because SLUB shrinker would be too costly to call since SLUB
    does not keep free slabs on a separate list. Besides, we currently do not
    even call per-memcg shrinkers for offline memcgs. Overall, it would
    introduce much more complexity to both SLUB and memcg than this small
    patch.

    Regarding to SLAB, there's no problem with it, because it shrinks
    per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
    longer keep entries for offline cgroups in per-memcg arrays (such as
    memcg_cache_params->memcg_caches), so we do not have to bother if a
    per-memcg cache will be shrunk a bit later than it could be.

    [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Sometimes, we need to iterate over all memcg copies of a particular root
    kmem cache. Currently, we use memcg_cache_params->memcg_caches array for
    that, because it contains all existing memcg caches.

    However, it's a bad practice to keep all caches, including those that
    belong to offline cgroups, in this array, because it will be growing
    beyond any bounds then. I'm going to wipe away dead caches from it to
    save space. To still be able to perform iterations over all memcg caches
    of the same kind, let us link them into a list.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

14 Dec, 2014

2 commits

  • fallback_alloc is called on kmalloc if the preferred node doesn't have
    free or partial slabs and there's no pages on the node's free list
    (GFP_THISNODE allocations fail). Before invoking the reclaimer it tries
    to locate a free or partial slab on other allowed nodes' lists. While
    iterating over the preferred node's zonelist it skips those zones which
    hardwall cpuset check returns false for. That means that for a task bound
    to a specific node using cpusets fallback_alloc will always ignore free
    slabs on other nodes and go directly to the reclaimer, which, however, may
    allocate from other nodes if cpuset.mem_hardwall is unset (default). As a
    result, we may get lists of free slabs grow without bounds on other nodes,
    which is bad, because inactive slabs are only evicted by cache_reap at a
    very slow rate and cannot be dropped forcefully.

    To reproduce the issue, run a process that will walk over a directory tree
    with lots of files inside a cpuset bound to a node that constantly
    experiences memory pressure. Look at num_slabs vs active_slabs growth as
    reported by /proc/slabinfo.

    To avoid this we should use softwall cpuset check in fallback_alloc.

    Signed-off-by: Vladimir Davydov
    Acked-by: Zefan Li
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Suppose task @t that belongs to a memory cgroup @memcg is going to
    allocate an object from a kmem cache @c. The copy of @c corresponding to
    @memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory
    cgroup destruction we can access the memory cgroup's copy of the cache
    after it was destroyed:

    CPU0 CPU1
    ---- ----
    [ current=@t
    @mc->memcg_params->nr_pages=0 ]

    kmem_cache_alloc(@c):
    call memcg_kmem_get_cache(@c);
    proceed to allocation from @mc:
    alloc a page for @mc:
    ...

    move @t from @memcg
    destroy @memcg:
    mem_cgroup_css_offline(@memcg):
    memcg_unregister_all_caches(@memcg):
    kmem_cache_destroy(@mc)

    add page to @mc

    We could fix this issue by taking a reference to a per-memcg cache, but
    that would require adding a per-cpu reference counter to per-memcg caches,
    which would look cumbersome.

    Instead, let's take a reference to a memory cgroup, which already has a
    per-cpu reference counter, in the beginning of kmem_cache_alloc to be
    dropped in the end, and move per memcg caches destruction from css offline
    to css free. As a side effect, per-memcg caches will be destroyed not one
    by one, but all at once when the last page accounted to the memory cgroup
    is freed. This doesn't sound as a high price for code readability though.

    Note, this patch does add some overhead to the kmem_cache_alloc hot path,
    but it is pretty negligible - it's just a function call plus a per cpu
    counter decrement, which is comparable to what we already have in
    memcg_kmem_get_cache. Besides, it's only relevant if there are memory
    cgroups with kmem accounting enabled. I don't think we can find a way to
    handle this race w/o it, because alloc_page called from kmem_cache_alloc
    may sleep so we can't flush all pending kmallocs w/o reference counting.

    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

12 Dec, 2014

1 commit

  • Pull cgroup update from Tejun Heo:
    "cpuset got simplified a bit. cgroup core got a fix on unified
    hierarchy and grew some effective css related interfaces which will be
    used for blkio support for writeback IO traffic which is currently
    being worked on"

    * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: implement cgroup_get_e_css()
    cgroup: add cgroup_subsys->css_e_css_changed()
    cgroup: add cgroup_subsys->css_released()
    cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
    cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
    cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
    cpuset: lock vs unlock typo
    cpuset: simplify cpuset_node_allowed API
    cpuset: convert callback_mutex to a spinlock

    Linus Torvalds
     

11 Dec, 2014

3 commits

  • The code goes BUG, but doesn't tell us which bits were unexpectedly set.
    Print that out.

    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Currently we print the slabinfo header in the seq start method, which
    makes it unusable for showing leaks, so we have leaks_show, which does
    practically the same as s_show except it doesn't show the header.

    However, we can print the header in the seq show method - we only need
    to check if the current element is the first on the list. This will
    allow us to use the same set of seq iterators for both leaks and
    slabinfo reporting, which is nice.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Some code in mm/slab.c and mm/slub.c use whitespaces in indent.
    Clean them up.

    Signed-off-by: LQYMGT
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    LQYMGT
     

04 Dec, 2014

1 commit

  • The bounds check for nodeid in ____cache_alloc_node gives false
    positives on machines where the node IDs are not contiguous, leading to
    a panic at boot time. For example, on a POWER8 machine the node IDs are
    typically 0, 1, 16 and 17. This means that num_online_nodes() returns
    4, so when ____cache_alloc_node is called with nodeid = 16 the VM_BUG_ON
    triggers, like this:

    kernel BUG at /home/paulus/kernel/kvm/mm/slab.c:3079!
    Call Trace:
    .____cache_alloc_node+0x5c/0x270 (unreliable)
    .kmem_cache_alloc_node_trace+0xdc/0x360
    .init_list+0x3c/0x128
    .kmem_cache_init+0x1dc/0x258
    .start_kernel+0x2a0/0x568
    start_here_common+0x20/0xa8

    To fix this, we instead compare the nodeid with MAX_NUMNODES, and
    additionally make sure it isn't negative (since nodeid is an int). The
    check is there mainly to protect the array dereference in the get_node()
    call in the next line, and the array being dereferenced is of size
    MAX_NUMNODES. If the nodeid is in range but invalid (for example if the
    node is off-line), the BUG_ON in the next line will catch that.

    Fixes: 14e50c6a9bc2 ("mm: slab: Verify the nodeid passed to ____cache_alloc_node")
    Signed-off-by: Paul Mackerras
    Reviewed-by: Yasuaki Ishimatsu
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Mackerras
     

27 Oct, 2014

1 commit

  • Current cpuset API for checking if a zone/node is allowed to allocate
    from looks rather awkward. We have hardwall and softwall versions of
    cpuset_node_allowed with the softwall version doing literally the same
    as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
    If it isn't, the softwall version may check the given node against the
    enclosing hardwall cpuset, which it needs to take the callback lock to
    do.

    Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
    rework cpuset_zone_allowed api"). Before, we had the only version with
    the __GFP_HARDWALL flag determining its behavior. The purpose of the
    commit was to avoid sleep-in-atomic bugs when someone would mistakenly
    call the function without the __GFP_HARDWALL flag for an atomic
    allocation. The suffixes introduced were intended to make the callers
    think before using the function.

    However, since the callback lock was converted from mutex to spinlock by
    the previous patch, the softwall check function cannot sleep, and these
    precautions are no longer necessary.

    So let's simplify the API back to the single check.

    Suggested-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Vladimir Davydov
     

14 Oct, 2014

1 commit

  • Commit bf0dea23a9c0 ("mm/slab: use percpu allocator for cpu cache")
    changed the allocation method for cpu cache array from slab allocator to
    percpu allocator. Alignment should be provided for aligned memory in
    percpu allocator case, but, that commit mistakenly set this alignment to
    0. So, percpu allocator returns unaligned memory address. It doesn't
    cause any problem on x86 which permits unaligned access, but, it causes
    the problem on sparc64 which needs strong guarantee of alignment.

    Following bug report is reported from David Miller.

    I'm getting tons of the following on sparc64:

    [603965.383447] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
    [603965.396987] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
    ...
    [603970.554394] log_unaligned: 333 callbacks suppressed
    ...

    This patch provides a proper alignment parameter when allocating cpu
    cache to fix this unaligned memory access problem on sparc64.

    Reported-by: David Miller
    Tested-by: David Miller
    Tested-by: Meelis Roos
    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

10 Oct, 2014

7 commits

  • Using __seq_open_private() removes boilerplate code from slabstats_open()

    The resultant code is shorter and easier to follow.

    This patch does not change any functionality.

    Signed-off-by: Rob Jones
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Jones
     
  • Because of chicken and egg problem, initialization of SLAB is really
    complicated. We need to allocate cpu cache through SLAB to make the
    kmem_cache work, but before initialization of kmem_cache, allocation
    through SLAB is impossible.

    On the other hand, SLUB does initialization in a more simple way. It uses
    percpu allocator to allocate cpu cache so there is no chicken and egg
    problem.

    So, this patch try to use percpu allocator in SLAB. This simplifies the
    initialization step in SLAB so that we could maintain SLAB code more
    easily.

    In my testing there is no performance difference.

    This implementation relies on percpu allocator. Because percpu allocator
    uses vmalloc address space, vmalloc address space could be exhausted by
    this change on many cpu system with *32 bit* kernel. This implementation
    can cover 1024 cpus in worst case by following calculation.

    Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches *
    120 objects per cpu_cache = 140 MB
    Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) *
    80 objects per cpu_cache = 46 MB

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Jeremiah Mahler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Slab merge is good feature to reduce fragmentation. If new creating slab
    have similar size and property with exsitent slab, this feature reuse it
    rather than creating new one. As a result, objects are packed into fewer
    slabs so that fragmentation is reduced.

    Below is result of my testing.

    * After boot, sleep 20; cat /proc/meminfo | grep Slab

    Slab: 25136 kB

    Slab: 24364 kB

    We can save 3% memory used by slab.

    For supporting this feature in SLAB, we need to implement SLAB specific
    kmem_cache_flag() and __kmem_cache_alias(), because SLUB implements some
    SLUB specific processing related to debug flag and object size change on
    these functions.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • cache_free_alien() is rarely used function when node mismatch. But, it is
    defined with inline attribute so it is inlined to __cache_free() which is
    core free function of slab allocator. It uselessly makes
    kmem_cache_free()/kfree() functions large. What we really need to inline
    is just checking node match so this patch factor out other parts of
    cache_free_alien() to reduce code size of kmem_cache_free()/ kfree().

    nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
    00000000000011e0 0000000000000228 T kfree
    0000000000000670 0000000000000216 T kmem_cache_free

    nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
    0000000000001110 00000000000001b5 T kfree
    0000000000000750 0000000000000181 T kmem_cache_free

    You can see slightly reduced size of text: 0x228->0x1b5, 0x216->0x181.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Our intention of __ac_put_obj() is that it doesn't affect anything if
    sk_memalloc_socks() is disabled. But, because __ac_put_obj() is too
    small, compiler inline it to ac_put_obj() and affect code size of free
    path. This patch add noinline keyword for __ac_put_obj() not to distrupt
    normal free path at all.

    nm -S slab-orig.o |
    grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"

    0000000000001e80 00000000000002f5 t cache_alloc_refill
    0000000000001230 0000000000000258 T kfree
    0000000000000690 000000000000024c T kmem_cache_free

    nm -S slab-patched.o |
    grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"

    0000000000001e00 00000000000002e5 t cache_alloc_refill
    00000000000011e0 0000000000000228 T kfree
    0000000000000670 0000000000000216 T kmem_cache_free

    cache_alloc_refill: 0x2f5->0x2e5
    kfree: 0x256->0x228
    kmem_cache_free: 0x24c->0x216

    code size of each function is reduced slightly.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, due to likely keyword, compiled code of cache_flusharray() is on
    unlikely.text section. Although it is uncommon case compared to free to
    cpu cache case, it is common case than free_block(). But, free_block() is
    on normal text section. This patch fix this odd situation to remove
    likely keyword.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, we track caller if tracing or slab debugging is enabled. If they are
    disabled, we could save one argument passing overhead by calling
    __kmalloc(_node)(). But, I think that it would be marginal. Furthermore,
    default slab allocator, SLUB, doesn't use this technique so I think that
    it's okay to change this situation.

    After this change, we can turn on/off CONFIG_DEBUG_SLAB without full
    kernel build and remove some complicated '#if' defintion. It looks more
    benefitial to me.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

28 Sep, 2014

1 commit

  • Pull cgroup fixes from Tejun Heo:
    "This is quite late but these need to be backported anyway.

    This is the fix for a long-standing cpuset bug which existed from
    2009. cpuset makes use of PF_SPREAD_{PAGE|SLAB} flags to modify the
    task's memory allocation behavior according to the settings of the
    cpuset it belongs to; unfortunately, when those flags have to be
    changed, cpuset did so directly even whlie the target task is running,
    which is obviously racy as task->flags may be modified by the task
    itself at any time. This obscure bug manifested as corrupt
    PF_USED_MATH flag leading to a weird crash.

    The bug is fixed by moving the flag to task->atomic_flags. The first
    two are prepatory ones to help defining atomic_flags accessors and the
    third one is the actual fix"

    * 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags
    sched: add macros to define bitops for task atomic flags
    sched: fix confusing PFA_NO_NEW_PRIVS constant

    Linus Torvalds
     

26 Sep, 2014

1 commit

  • Since commit 4590685546a3 ("mm/sl[aou]b: Common alignment code"), the
    "ralign" automatic variable in __kmem_cache_create() may be used as
    uninitialized.

    The proper alignment defaults to BYTES_PER_WORD and can be overridden by
    SLAB_RED_ZONE or the alignment specified by the caller.

    This fixes https://bugzilla.kernel.org/show_bug.cgi?id=85031

    Signed-off-by: David Rientjes
    Reported-by: Andrei Elovikov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 Sep, 2014

1 commit

  • When we change cpuset.memory_spread_{page,slab}, cpuset will flip
    PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
    This should be done using atomic bitops, but currently we don't,
    which is broken.

    Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
    when one thread tried to clear PF_USED_MATH while at the same time another
    thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
    the same task.

    Here's the full report:
    https://lkml.org/lkml/2014/9/19/230

    To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

    v4:
    - updated mm/slab.c. (Fengguang Wu)
    - updated Documentation.

    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Miao Xie
    Cc: Kees Cook
    Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
    Cc: # 2.6.31+
    Reported-by: Tetsuo Handa
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     

09 Aug, 2014

1 commit

  • This reverts commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC").

    commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC") assumes that the
    system with !CONFIG_NUMA has only one memory node. But, it turns out to
    be false by the report from Geert. His system, m68k, has many memory
    nodes and is configured in !CONFIG_NUMA. So it couldn't boot with above
    change.

    Here goes his failure report.

    With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM:

    enable_cpucache failed for radix_tree_node, error 12.
    kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522!
    *** TRAP #7 *** FORMAT=0
    Current process id is 0
    BAD KERNEL TRAP: 00000000
    Modules linked in:
    PC: [] kmem_cache_init_late+0x70/0x8c
    SR: 2200 SP: 00345f90 a2: 0034c2e8
    d0: 0000003d d1: 00000000 d2: 00000000 d3: 003ac942
    d4: 00000000 d5: 00000000 a0: 0034f686 a1: 0034f682
    Process swapper (pid: 0, task=0034c2e8)
    Frame format=0
    Stack from 00345fc4:
    002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000
    00000000 00000000 00000000 00000000 00000000 003ac942 00000000
    003912d6
    Call Trace: [] parse_args+0x0/0x2ca
    [] strlen+0x0/0x1a
    [] start_kernel+0x23c/0x428
    [] _sinittext+0x2d6/0x95e

    Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 0035 4b1c 61ff
    fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c
    Disabling lock debugging due to kernel taint
    Kernel panic - not syncing: Attempted to kill the idle task!

    Although there is a alternative way to fix this issue such as disabling
    use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better
    to me in this time.

    Signed-off-by: Joonsoo Kim
    Reported-by: Geert Uytterhoeven
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

07 Aug, 2014

13 commits

  • Current struct kmem_cache has no 'lock' field, and slab page is managed by
    struct kmem_cache_node, which has 'list_lock' field.

    Clean up the related comment.

    Signed-off-by: Wang Sheng-Hui
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • It is better to represent allocation size in size_t rather than int. So
    change it.

    Signed-off-by: Joonsoo Kim
    Suggested-by: Andrew Morton
    Cc: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • BAD_ALIEN_MAGIC value isn't used anymore. So remove it.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, there is no code to hold two lock simultaneously, since we don't
    call slab_destroy() with holding any lock. So, lockdep annotation is
    useless now. Remove it.

    v2: don't remove BAD_ALIEN_MAGIC in this patch. It will be removed
    in the following patch.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • I haven't heard that this alien cache lock is contended, but to reduce
    chance of contention would be better generally. And with this change,
    we can simplify complex lockdep annotation in slab code. In the
    following patch, it will be implemented.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, we have separate alien_cache structure, so it'd be better to hold
    the lock on alien_cache while manipulating alien_cache. After that, we
    don't need the lock on array_cache, so remove it.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, we use array_cache for alien_cache. Although they are mostly
    similar, there is one difference, that is, need for spinlock. We don't
    need spinlock for array_cache itself, but to use array_cache for
    alien_cache, array_cache structure should have spinlock. This is
    needless overhead, so removing it would be better. This patch prepare
    it by introducing alien_cache and using it. In the following patch, we
    remove spinlock in array_cache.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Factor out initialization of array cache to use it in following patch.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In free_block(), if freeing object makes new free slab and number of
    free_objects exceeds free_limit, we start to destroy this new free slab
    with holding the kmem_cache node lock. Holding the lock is useless and,
    generally, holding a lock as least as possible is good thing. I never
    measure performance effect of this, but we'd be better not to hold the
    lock as much as possible.

    Commented by Christoph:
    This is also good because kmem_cache_free is no longer called while
    holding the node lock. So we avoid one case of recursion.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • node isn't changed, so we don't need to retreive this structure
    everytime we move the object. Maybe compiler do this optimization, but
    making it explicitly is better.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • This patchset does some cleanup and tries to remove lockdep annotation.

    Patches 1~2 are just for really really minor improvement.
    Patches 3~9 are for clean-up and removing lockdep annotation.

    There are two cases that lockdep annotation is needed in SLAB.
    1) holding two node locks
    2) holding two array cache(alien cache) locks

    I looked at the code and found that we can avoid these cases without any
    negative effect.

    1) occurs if freeing object makes new free slab and we decide to
    destroy it. Although we don't need to hold the lock during destroying
    a slab, current code do that. Destroying a slab without holding the
    lock would help the reduction of the lock contention. To do it, I
    change the implementation that new free slab is destroyed after
    releasing the lock.

    2) occurs on similar situation. When we free object from non-local
    node, we put this object to alien cache with holding the alien cache
    lock. If alien cache is full, we try to flush alien cache to proper
    node cache, and, in this time, new free slab could be made. Destroying
    it would be started and we will free metadata object which comes from
    another node. In this case, we need another node's alien cache lock to
    free object. This forces us to hold two array cache locks and then we
    need lockdep annotation although they are always different locks and
    deadlock cannot be possible. To prevent this situation, I use same way
    as 1).

    In this way, we can avoid 1) and 2) cases, and then, can remove lockdep
    annotation. As short stat noted, this makes SLAB code much simpler.

    This patch (of 9):

    slab_should_failslab() is called on every allocation, so to optimize it
    is reasonable. We normally don't allocate from kmem_cache. It is just
    used when new kmem_cache is created, so it's very rare case. Therefore,
    add unlikely macro to help compiler optimization.

    Signed-off-by: Joonsoo Kim
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Use the two functions to simplify the code avoiding numerous explicit
    checks coded checking for a certain node to be online.

    Get rid of various repeated calculations of kmem_cache_node structures.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Christoph Lameter
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • init_lock_keys is only called by __init kmem_cache_init_late

    Signed-off-by: Fabian Frederick
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

24 Jun, 2014

1 commit

  • Commit b1cb0982bdd6 ("change the management method of free objects of
    the slab") introduced a bug on slab leak detector
    ('/proc/slab_allocators'). This detector works like as following
    decription.

    1. traverse all objects on all the slabs.
    2. determine whether it is active or not.
    3. if active, print who allocate this object.

    but that commit changed the way how to manage free objects, so the logic
    determining whether it is active or not is also changed. In before, we
    regard object in cpu caches as inactive one, but, with this commit, we
    mistakenly regard object in cpu caches as active one.

    This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If
    DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who
    corrupt free memory in the slab. It unmaps page table mapping if object
    is free and map it if object is active. When slab leak detector check
    object in cpu caches, it mistakenly think this object active so try to
    access object memory to retrieve caller of allocation. At this point,
    page table mapping to this object doesn't exist, so oops occurs.

    Following is oops message reported from Dave.

    It blew up when something tried to read /proc/slab_allocators
    (Just cat it, and you should see the oops below)

    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in:
    [snip...]
    CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131
    task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000
    RIP: 0010:[] [] handle_slab+0x8a/0x180
    RSP: 0018:ffff880076925de0 EFLAGS: 00010002
    RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7
    RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000
    RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000
    R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0
    FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0
    DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602
    Call Trace:
    leaks_show+0xce/0x240
    seq_read+0x28e/0x490
    proc_reg_read+0x3d/0x80
    vfs_read+0x9b/0x160
    SyS_read+0x58/0xb0
    tracesys+0xd4/0xd9
    Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46
    RIP handle_slab+0x8a/0x180

    To fix the problem, I introduce an object status buffer on each slab.
    With this, we can track object status precisely, so slab leak detector
    would not access active object and no kernel oops would occur. Memory
    overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK
    which is mainly used for debugging, so memory overhead isn't big
    problem.

    Signed-off-by: Joonsoo Kim
    Reported-by: Dave Jones
    Reported-by: Tetsuo Handa
    Reviewed-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

05 Jun, 2014

1 commit

  • Currently we have two pairs of kmemcg-related functions that are called on
    slab alloc/free. The first is memcg_{bind,release}_pages that count the
    total number of pages allocated on a kmem cache. The second is
    memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
    counter. Let's just merge them to keep the code clean.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov