14 Oct, 2014

1 commit

  • Commit bf0dea23a9c0 ("mm/slab: use percpu allocator for cpu cache")
    changed the allocation method for cpu cache array from slab allocator to
    percpu allocator. Alignment should be provided for aligned memory in
    percpu allocator case, but, that commit mistakenly set this alignment to
    0. So, percpu allocator returns unaligned memory address. It doesn't
    cause any problem on x86 which permits unaligned access, but, it causes
    the problem on sparc64 which needs strong guarantee of alignment.

    Following bug report is reported from David Miller.

    I'm getting tons of the following on sparc64:

    [603965.383447] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0
    [603965.396987] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0
    ...
    [603970.554394] log_unaligned: 333 callbacks suppressed
    ...

    This patch provides a proper alignment parameter when allocating cpu
    cache to fix this unaligned memory access problem on sparc64.

    Reported-by: David Miller
    Tested-by: David Miller
    Tested-by: Meelis Roos
    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

10 Oct, 2014

7 commits

  • Using __seq_open_private() removes boilerplate code from slabstats_open()

    The resultant code is shorter and easier to follow.

    This patch does not change any functionality.

    Signed-off-by: Rob Jones
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Jones
     
  • Because of chicken and egg problem, initialization of SLAB is really
    complicated. We need to allocate cpu cache through SLAB to make the
    kmem_cache work, but before initialization of kmem_cache, allocation
    through SLAB is impossible.

    On the other hand, SLUB does initialization in a more simple way. It uses
    percpu allocator to allocate cpu cache so there is no chicken and egg
    problem.

    So, this patch try to use percpu allocator in SLAB. This simplifies the
    initialization step in SLAB so that we could maintain SLAB code more
    easily.

    In my testing there is no performance difference.

    This implementation relies on percpu allocator. Because percpu allocator
    uses vmalloc address space, vmalloc address space could be exhausted by
    this change on many cpu system with *32 bit* kernel. This implementation
    can cover 1024 cpus in worst case by following calculation.

    Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches *
    120 objects per cpu_cache = 140 MB
    Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) *
    80 objects per cpu_cache = 46 MB

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Jeremiah Mahler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Slab merge is good feature to reduce fragmentation. If new creating slab
    have similar size and property with exsitent slab, this feature reuse it
    rather than creating new one. As a result, objects are packed into fewer
    slabs so that fragmentation is reduced.

    Below is result of my testing.

    * After boot, sleep 20; cat /proc/meminfo | grep Slab

    Slab: 25136 kB

    Slab: 24364 kB

    We can save 3% memory used by slab.

    For supporting this feature in SLAB, we need to implement SLAB specific
    kmem_cache_flag() and __kmem_cache_alias(), because SLUB implements some
    SLUB specific processing related to debug flag and object size change on
    these functions.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • cache_free_alien() is rarely used function when node mismatch. But, it is
    defined with inline attribute so it is inlined to __cache_free() which is
    core free function of slab allocator. It uselessly makes
    kmem_cache_free()/kfree() functions large. What we really need to inline
    is just checking node match so this patch factor out other parts of
    cache_free_alien() to reduce code size of kmem_cache_free()/ kfree().

    nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
    00000000000011e0 0000000000000228 T kfree
    0000000000000670 0000000000000216 T kmem_cache_free

    nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
    0000000000001110 00000000000001b5 T kfree
    0000000000000750 0000000000000181 T kmem_cache_free

    You can see slightly reduced size of text: 0x228->0x1b5, 0x216->0x181.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Our intention of __ac_put_obj() is that it doesn't affect anything if
    sk_memalloc_socks() is disabled. But, because __ac_put_obj() is too
    small, compiler inline it to ac_put_obj() and affect code size of free
    path. This patch add noinline keyword for __ac_put_obj() not to distrupt
    normal free path at all.

    nm -S slab-orig.o |
    grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"

    0000000000001e80 00000000000002f5 t cache_alloc_refill
    0000000000001230 0000000000000258 T kfree
    0000000000000690 000000000000024c T kmem_cache_free

    nm -S slab-patched.o |
    grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"

    0000000000001e00 00000000000002e5 t cache_alloc_refill
    00000000000011e0 0000000000000228 T kfree
    0000000000000670 0000000000000216 T kmem_cache_free

    cache_alloc_refill: 0x2f5->0x2e5
    kfree: 0x256->0x228
    kmem_cache_free: 0x24c->0x216

    code size of each function is reduced slightly.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, due to likely keyword, compiled code of cache_flusharray() is on
    unlikely.text section. Although it is uncommon case compared to free to
    cpu cache case, it is common case than free_block(). But, free_block() is
    on normal text section. This patch fix this odd situation to remove
    likely keyword.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, we track caller if tracing or slab debugging is enabled. If they are
    disabled, we could save one argument passing overhead by calling
    __kmalloc(_node)(). But, I think that it would be marginal. Furthermore,
    default slab allocator, SLUB, doesn't use this technique so I think that
    it's okay to change this situation.

    After this change, we can turn on/off CONFIG_DEBUG_SLAB without full
    kernel build and remove some complicated '#if' defintion. It looks more
    benefitial to me.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

28 Sep, 2014

1 commit

  • Pull cgroup fixes from Tejun Heo:
    "This is quite late but these need to be backported anyway.

    This is the fix for a long-standing cpuset bug which existed from
    2009. cpuset makes use of PF_SPREAD_{PAGE|SLAB} flags to modify the
    task's memory allocation behavior according to the settings of the
    cpuset it belongs to; unfortunately, when those flags have to be
    changed, cpuset did so directly even whlie the target task is running,
    which is obviously racy as task->flags may be modified by the task
    itself at any time. This obscure bug manifested as corrupt
    PF_USED_MATH flag leading to a weird crash.

    The bug is fixed by moving the flag to task->atomic_flags. The first
    two are prepatory ones to help defining atomic_flags accessors and the
    third one is the actual fix"

    * 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags
    sched: add macros to define bitops for task atomic flags
    sched: fix confusing PFA_NO_NEW_PRIVS constant

    Linus Torvalds
     

26 Sep, 2014

1 commit

  • Since commit 4590685546a3 ("mm/sl[aou]b: Common alignment code"), the
    "ralign" automatic variable in __kmem_cache_create() may be used as
    uninitialized.

    The proper alignment defaults to BYTES_PER_WORD and can be overridden by
    SLAB_RED_ZONE or the alignment specified by the caller.

    This fixes https://bugzilla.kernel.org/show_bug.cgi?id=85031

    Signed-off-by: David Rientjes
    Reported-by: Andrei Elovikov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 Sep, 2014

1 commit

  • When we change cpuset.memory_spread_{page,slab}, cpuset will flip
    PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
    This should be done using atomic bitops, but currently we don't,
    which is broken.

    Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
    when one thread tried to clear PF_USED_MATH while at the same time another
    thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
    the same task.

    Here's the full report:
    https://lkml.org/lkml/2014/9/19/230

    To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

    v4:
    - updated mm/slab.c. (Fengguang Wu)
    - updated Documentation.

    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Miao Xie
    Cc: Kees Cook
    Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
    Cc: # 2.6.31+
    Reported-by: Tetsuo Handa
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     

09 Aug, 2014

1 commit

  • This reverts commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC").

    commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC") assumes that the
    system with !CONFIG_NUMA has only one memory node. But, it turns out to
    be false by the report from Geert. His system, m68k, has many memory
    nodes and is configured in !CONFIG_NUMA. So it couldn't boot with above
    change.

    Here goes his failure report.

    With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM:

    enable_cpucache failed for radix_tree_node, error 12.
    kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522!
    *** TRAP #7 *** FORMAT=0
    Current process id is 0
    BAD KERNEL TRAP: 00000000
    Modules linked in:
    PC: [] kmem_cache_init_late+0x70/0x8c
    SR: 2200 SP: 00345f90 a2: 0034c2e8
    d0: 0000003d d1: 00000000 d2: 00000000 d3: 003ac942
    d4: 00000000 d5: 00000000 a0: 0034f686 a1: 0034f682
    Process swapper (pid: 0, task=0034c2e8)
    Frame format=0
    Stack from 00345fc4:
    002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000
    00000000 00000000 00000000 00000000 00000000 003ac942 00000000
    003912d6
    Call Trace: [] parse_args+0x0/0x2ca
    [] strlen+0x0/0x1a
    [] start_kernel+0x23c/0x428
    [] _sinittext+0x2d6/0x95e

    Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 0035 4b1c 61ff
    fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c
    Disabling lock debugging due to kernel taint
    Kernel panic - not syncing: Attempted to kill the idle task!

    Although there is a alternative way to fix this issue such as disabling
    use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better
    to me in this time.

    Signed-off-by: Joonsoo Kim
    Reported-by: Geert Uytterhoeven
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

07 Aug, 2014

13 commits

  • Current struct kmem_cache has no 'lock' field, and slab page is managed by
    struct kmem_cache_node, which has 'list_lock' field.

    Clean up the related comment.

    Signed-off-by: Wang Sheng-Hui
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • It is better to represent allocation size in size_t rather than int. So
    change it.

    Signed-off-by: Joonsoo Kim
    Suggested-by: Andrew Morton
    Cc: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • BAD_ALIEN_MAGIC value isn't used anymore. So remove it.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, there is no code to hold two lock simultaneously, since we don't
    call slab_destroy() with holding any lock. So, lockdep annotation is
    useless now. Remove it.

    v2: don't remove BAD_ALIEN_MAGIC in this patch. It will be removed
    in the following patch.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • I haven't heard that this alien cache lock is contended, but to reduce
    chance of contention would be better generally. And with this change,
    we can simplify complex lockdep annotation in slab code. In the
    following patch, it will be implemented.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, we have separate alien_cache structure, so it'd be better to hold
    the lock on alien_cache while manipulating alien_cache. After that, we
    don't need the lock on array_cache, so remove it.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, we use array_cache for alien_cache. Although they are mostly
    similar, there is one difference, that is, need for spinlock. We don't
    need spinlock for array_cache itself, but to use array_cache for
    alien_cache, array_cache structure should have spinlock. This is
    needless overhead, so removing it would be better. This patch prepare
    it by introducing alien_cache and using it. In the following patch, we
    remove spinlock in array_cache.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Factor out initialization of array cache to use it in following patch.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In free_block(), if freeing object makes new free slab and number of
    free_objects exceeds free_limit, we start to destroy this new free slab
    with holding the kmem_cache node lock. Holding the lock is useless and,
    generally, holding a lock as least as possible is good thing. I never
    measure performance effect of this, but we'd be better not to hold the
    lock as much as possible.

    Commented by Christoph:
    This is also good because kmem_cache_free is no longer called while
    holding the node lock. So we avoid one case of recursion.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • node isn't changed, so we don't need to retreive this structure
    everytime we move the object. Maybe compiler do this optimization, but
    making it explicitly is better.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • This patchset does some cleanup and tries to remove lockdep annotation.

    Patches 1~2 are just for really really minor improvement.
    Patches 3~9 are for clean-up and removing lockdep annotation.

    There are two cases that lockdep annotation is needed in SLAB.
    1) holding two node locks
    2) holding two array cache(alien cache) locks

    I looked at the code and found that we can avoid these cases without any
    negative effect.

    1) occurs if freeing object makes new free slab and we decide to
    destroy it. Although we don't need to hold the lock during destroying
    a slab, current code do that. Destroying a slab without holding the
    lock would help the reduction of the lock contention. To do it, I
    change the implementation that new free slab is destroyed after
    releasing the lock.

    2) occurs on similar situation. When we free object from non-local
    node, we put this object to alien cache with holding the alien cache
    lock. If alien cache is full, we try to flush alien cache to proper
    node cache, and, in this time, new free slab could be made. Destroying
    it would be started and we will free metadata object which comes from
    another node. In this case, we need another node's alien cache lock to
    free object. This forces us to hold two array cache locks and then we
    need lockdep annotation although they are always different locks and
    deadlock cannot be possible. To prevent this situation, I use same way
    as 1).

    In this way, we can avoid 1) and 2) cases, and then, can remove lockdep
    annotation. As short stat noted, this makes SLAB code much simpler.

    This patch (of 9):

    slab_should_failslab() is called on every allocation, so to optimize it
    is reasonable. We normally don't allocate from kmem_cache. It is just
    used when new kmem_cache is created, so it's very rare case. Therefore,
    add unlikely macro to help compiler optimization.

    Signed-off-by: Joonsoo Kim
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Use the two functions to simplify the code avoiding numerous explicit
    checks coded checking for a certain node to be online.

    Get rid of various repeated calculations of kmem_cache_node structures.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Christoph Lameter
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • init_lock_keys is only called by __init kmem_cache_init_late

    Signed-off-by: Fabian Frederick
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

24 Jun, 2014

1 commit

  • Commit b1cb0982bdd6 ("change the management method of free objects of
    the slab") introduced a bug on slab leak detector
    ('/proc/slab_allocators'). This detector works like as following
    decription.

    1. traverse all objects on all the slabs.
    2. determine whether it is active or not.
    3. if active, print who allocate this object.

    but that commit changed the way how to manage free objects, so the logic
    determining whether it is active or not is also changed. In before, we
    regard object in cpu caches as inactive one, but, with this commit, we
    mistakenly regard object in cpu caches as active one.

    This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If
    DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who
    corrupt free memory in the slab. It unmaps page table mapping if object
    is free and map it if object is active. When slab leak detector check
    object in cpu caches, it mistakenly think this object active so try to
    access object memory to retrieve caller of allocation. At this point,
    page table mapping to this object doesn't exist, so oops occurs.

    Following is oops message reported from Dave.

    It blew up when something tried to read /proc/slab_allocators
    (Just cat it, and you should see the oops below)

    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in:
    [snip...]
    CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131
    task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000
    RIP: 0010:[] [] handle_slab+0x8a/0x180
    RSP: 0018:ffff880076925de0 EFLAGS: 00010002
    RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7
    RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000
    RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000
    R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0
    FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0
    DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602
    Call Trace:
    leaks_show+0xce/0x240
    seq_read+0x28e/0x490
    proc_reg_read+0x3d/0x80
    vfs_read+0x9b/0x160
    SyS_read+0x58/0xb0
    tracesys+0xd4/0xd9
    Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46
    RIP handle_slab+0x8a/0x180

    To fix the problem, I introduce an object status buffer on each slab.
    With this, we can track object status precisely, so slab leak detector
    would not access active object and no kernel oops would occur. Memory
    overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK
    which is mainly used for debugging, so memory overhead isn't big
    problem.

    Signed-off-by: Joonsoo Kim
    Reported-by: Dave Jones
    Reported-by: Tetsuo Handa
    Reviewed-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

05 Jun, 2014

4 commits

  • Currently we have two pairs of kmemcg-related functions that are called on
    slab alloc/free. The first is memcg_{bind,release}_pages that count the
    total number of pages allocated on a kmem cache. The second is
    memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
    counter. Let's just merge them to keep the code clean.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When we create a sl[au]b cache, we allocate kmem_cache_node structures
    for each online NUMA node. To handle nodes taken online/offline, we
    register memory hotplug notifier and allocate/free kmem_cache_node
    corresponding to the node that changes its state for each kmem cache.

    To synchronize between the two paths we hold the slab_mutex during both
    the cache creationg/destruction path and while tuning per-node parts of
    kmem caches in memory hotplug handler, but that's not quite right,
    because it does not guarantee that a newly created cache will have all
    kmem_cache_nodes initialized in case it races with memory hotplug. For
    instance, in case of slub:

    CPU0 CPU1
    ---- ----
    kmem_cache_create: online_pages:
    __kmem_cache_create: slab_memory_callback:
    slab_mem_going_online_callback:
    lock slab_mutex
    for each slab_caches list entry
    allocate kmem_cache node
    unlock slab_mutex
    lock slab_mutex
    init_kmem_cache_nodes:
    for_each_node_state(node, N_NORMAL_MEMORY)
    allocate kmem_cache node
    add kmem_cache to slab_caches list
    unlock slab_mutex
    online_pages (continued):
    node_states_set_node

    As a result we'll get a kmem cache with not all kmem_cache_nodes
    allocated.

    To avoid issues like that we should hold get/put_online_mems() during
    the whole kmem cache creation/destruction/shrink paths, just like we
    deal with cpu hotplug. This patch does the trick.

    Note, that after it's applied, there is no need in taking the slab_mutex
    for kmem_cache_shrink any more, so it is removed from there.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We have only a few places where we actually want to charge kmem so
    instead of intruding into the general page allocation path with
    __GFP_KMEMCG it's better to explictly charge kmem there. All kmem
    charges will be easier to follow that way.

    This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
    from memcg caches' allocflags. Instead it makes slab allocation path
    call memcg_charge_kmem directly getting memcg to charge from the cache's
    memcg params.

    This also eliminates any possibility of misaccounting an allocation
    going from one memcg's cache to another memcg, because now we always
    charge slabs against the memcg the cache belongs to. That's why this
    patch removes the big comment to memcg_kmem_get_cache.

    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When the slab or slub allocators cannot allocate additional slab pages,
    they emit diagnostic information to the kernel log such as current
    number of slabs, number of objects, active objects, etc. This is always
    coupled with a page allocation failure warning since it is controlled by
    !__GFP_NOWARN.

    Suppress this out of memory warning if the allocator is configured
    without debug supported. The page allocation failure warning will
    indicate it is a failed slab allocation, the order, and the gfp mask, so
    this is only useful to diagnose allocator issues.

    Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
    allocator, there is no functional change with this patch. If debug is
    disabled, however, the warnings are now suppressed.

    Signed-off-by: David Rientjes
    Cc: Pekka Enberg
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

06 May, 2014

2 commits

  • If freelist_idx_t is a byte, SLAB_OBJ_MAX_NUM should be 255 not 256, and
    likewise if freelist_idx_t is a short, then it should be 65535 not
    65536.

    This was leading to all kinds of random crashes on sparc64 where
    PAGE_SIZE is 8192. One problem shown was that if spinlock debugging was
    enabled, we'd get deadlocks in copy_pte_range() or do_wp_page() with the
    same cpu already holding a lock it shouldn't hold, or the lock belonging
    to a completely unrelated process.

    Fixes: a41adfaa23df ("slab: introduce byte sized index for the freelist of a slab")
    Signed-off-by: David S. Miller
    Signed-off-by: Linus Torvalds

    David Miller
     
  • Commit a41adfaa23df ("slab: introduce byte sized index for the freelist
    of a slab") changes the size of freelist index and also changes
    prototype of accessor function to freelist index. And there was a
    mistake.

    The mistake is that although it changes the size of freelist index
    correctly, it changes the size of the index of freelist index
    incorrectly. With patch, freelist index can be 1 byte or 2 bytes, that
    means that num of object on on a slab can be more than 255. So we need
    more than 1 byte for the index to find the index of free object on
    freelist. But, above patch makes this index type 1 byte, so slab which
    have more than 255 objects cannot work properly and in consequence of
    it, the system cannot boot.

    This issue was reported by Steven King on m68knommu which would use
    2 bytes freelist index:

    https://lkml.org/lkml/2014/4/16/433

    To fix is easy. To change the type of the index of freelist index on
    accessor functions is enough to fix this bug. Although 2 bytes is
    enough, I use 4 bytes since it have no bad effect and make things more
    easier. This fix was suggested and tested by Steven in his original
    report.

    Signed-off-by: Joonsoo Kim
    Reported-and-acked-by: Steven King
    Acked-by: Christoph Lameter
    Tested-by: James Hogan
    Tested-by: David Miller
    Cc: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

14 Apr, 2014

1 commit

  • Pull slab changes from Pekka Enberg:
    "The biggest change is byte-sized freelist indices which reduces slab
    freelist memory usage:

    https://lkml.org/lkml/2013/12/2/64"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm: slab/slub: use page->list consistently instead of page->lru
    mm/slab.c: cleanup outdated comments and unify variables naming
    slab: fix wrongly used macro
    slub: fix high order page allocation problem with __GFP_NOFAIL
    slab: Make allocations with GFP_ZERO slightly more efficient
    slab: make more slab management structure off the slab
    slab: introduce byte sized index for the freelist of a slab
    slab: restrict the number of objects in a slab
    slab: introduce helper functions to get/set free object
    slab: factor out calculate nr objects in cache_estimate

    Linus Torvalds
     

11 Apr, 2014

1 commit

  • 'struct page' has two list_head fields: 'lru' and 'list'. Conveniently,
    they are unioned together. This means that code can use them
    interchangably, which gets horribly confusing like with this nugget from
    slab.c:

    > list_del(&page->lru);
    > if (page->active == cachep->num)
    > list_add(&page->list, &n->slabs_full);

    This patch makes the slab and slub code use page->lru universally instead
    of mixing ->list and ->lru.

    So, the new rule is: page->lru is what the you use if you want to keep
    your page on a list. Don't like the fact that it's not called ->list?
    Too bad.

    Signed-off-by: Dave Hansen
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Pekka Enberg

    Dave Hansen
     

08 Apr, 2014

2 commits

  • PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
    There's no significant performance degradation to checking
    current->mempolicy rather than current->flags & PF_MEMPOLICY in the
    allocation path, especially since this is considered unlikely().

    Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
    64GB of memory and without a mempolicy:

    threads before after
    16 1249409 1244487
    32 1281786 1246783
    48 1239175 1239138
    64 1244642 1241841
    80 1244346 1248918
    96 1266436 1254316
    112 1307398 1312135
    128 1327607 1326502

    Per-process flags are a scarce resource so we should free them up whenever
    possible and make them available. We'll be using it shortly for memcg oom
    reserves.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • slab_node() is actually a mempolicy function, so rename it to
    mempolicy_slab_node() to make it clearer that it used for processes with
    mempolicies.

    At the same time, cleanup its code by saving numa_mem_id() in a local
    variable (since we require a node with memory, not just any node) and
    remove an obsolete comment that assumes the mempolicy is actually passed
    into the function.

    Signed-off-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

04 Apr, 2014

1 commit

  • Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Apr, 2014

1 commit


08 Feb, 2014

2 commits

  • Use the likely mechanism already around valid
    pointer tests to better choose when to memset
    to 0 allocations with __GFP_ZERO

    Acked-by: Christoph Lameter
    Signed-off-by: Joe Perches
    Signed-off-by: Pekka Enberg

    Joe Perches
     
  • Now, the size of the freelist for the slab management diminish,
    so that the on-slab management structure can waste large space
    if the object of the slab is large.

    Consider a 128 byte sized slab. If on-slab is used, 31 objects can be
    in the slab. The size of the freelist for this case would be 31 bytes
    so that 97 bytes, that is, more than 75% of object size, are wasted.

    In a 64 byte sized slab case, no space is wasted if we use on-slab.
    So set off-slab determining constraint to 128 bytes.

    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Pekka Enberg

    Joonsoo Kim