04 Oct, 2018

1 commit

  • commit e5d9998f3e09359b372a037a6ac55ba235d95d57 upstream.

    /*
    * cpu_partial determined the maximum number of objects
    * kept in the per cpu partial lists of a processor.
    */

    Can't be negative.

    Link: http://lkml.kernel.org/r/20180305200730.15812-15-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: zhong jiang
    Signed-off-by: Greg Kroah-Hartman

    Alexey Dobriyan
     

03 Aug, 2018

1 commit

  • [ Upstream commit a38965bf941b7c2af50de09c96bc5f03e136caef ]

    __printf is useful to verify format and arguments. Remove the following
    warning (with W=1):

    mm/slub.c:721:2: warning: function might be possible candidate for `gnu_printf' format attribute [-Wsuggest-attribute=format]

    Link: http://lkml.kernel.org/r/20180505200706.19986-1-malat@debian.org
    Signed-off-by: Mathieu Malaterre
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Malaterre
     

03 Jul, 2018

1 commit

  • commit d50d82faa0c964e31f7a946ba8aba7c715ca7ab0 upstream.

    In kernel 4.17 I removed some code from dm-bufio that did slab cache
    merging (commit 21bb13276768: "dm bufio: remove code that merges slab
    caches") - both slab and slub support merging caches with identical
    attributes, so dm-bufio now just calls kmem_cache_create and relies on
    implicit merging.

    This uncovered a bug in the slub subsystem - if we delete a cache and
    immediatelly create another cache with the same attributes, it fails
    because of duplicate filename in /sys/kernel/slab/. The slub subsystem
    offloads freeing the cache to a workqueue - and if we create the new
    cache before the workqueue runs, it complains because of duplicate
    filename in sysfs.

    This patch fixes the bug by moving the call of kobject_del from
    sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called
    while we hold slab_mutex - so that the sysfs entry is deleted before a
    cache with the same attributes could be created.

    Running device-mapper-test-suite with:

    dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/

    triggered:

    Buffer I/O error on dev dm-0, logical block 1572848, async page read
    device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
    device-mapper: thin: 253:1: aborting current metadata transaction
    sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
    CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
    Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
    Workqueue: dm-thin do_worker [dm_thin_pool]
    Call Trace:
    dump_stack+0x5a/0x73
    sysfs_warn_dup+0x58/0x70
    sysfs_create_dir_ns+0x77/0x80
    kobject_add_internal+0xba/0x2e0
    kobject_init_and_add+0x70/0xb0
    sysfs_slab_add+0xb1/0x250
    __kmem_cache_create+0x116/0x150
    create_cache+0xd9/0x1f0
    kmem_cache_create_usercopy+0x1c1/0x250
    kmem_cache_create+0x18/0x20
    dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
    dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
    __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
    dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
    metadata_operation_failed+0x59/0x100 [dm_thin_pool]
    alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
    process_cell+0x2a3/0x550 [dm_thin_pool]
    do_worker+0x28d/0x8f0 [dm_thin_pool]
    process_one_work+0x171/0x370
    worker_thread+0x49/0x3f0
    kthread+0xf8/0x130
    ret_from_fork+0x35/0x40
    kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
    kmem_cache_create(dm_bufio_buffer-16) failed with error -17

    Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Mikulas Patocka
    Reported-by: Mike Snitzer
    Tested-by: Mike Snitzer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     

22 Feb, 2018

4 commits

  • commit 4675ff05de2d76d167336b368bd07f3fef6ed5a6 upstream.

    Fix up makefiles, remove references, and git rm kmemcheck.

    Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Steven Rostedt
    Cc: Vegard Nossum
    Cc: Pekka Enberg
    Cc: Michal Hocko
    Cc: Eric W. Biederman
    Cc: Alexander Potapenko
    Cc: Tim Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     
  • commit d8be75663cec0069b85f80191abd2682ce4a512f upstream.

    Now that kmemcheck is gone, we don't need the NOTRACK flags.

    Link: http://lkml.kernel.org/r/20171007030159.22241-5-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     
  • commit 75f296d93bcebcfe375884ddac79e30263a31766 upstream.

    Convert all allocations that used a NOTRACK flag to stop using it.

    Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     
  • commit 4950276672fce5c241857540f8561c440663673d upstream.

    Patch series "kmemcheck: kill kmemcheck", v2.

    As discussed at LSF/MM, kill kmemcheck.

    KASan is a replacement that is able to work without the limitation of
    kmemcheck (single CPU, slow). KASan is already upstream.

    We are also not aware of any users of kmemcheck (or users who don't
    consider KASan as a suitable replacement).

    The only objection was that since KASAN wasn't supported by all GCC
    versions provided by distros at that time we should hold off for 2
    years, and try again.

    Now that 2 years have passed, and all distros provide gcc that supports
    KASAN, kill kmemcheck again for the very same reasons.

    This patch (of 4):

    Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

    [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
    Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
    Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

14 Dec, 2017

1 commit

  • [ Upstream commit 11066386efa692f77171484c32ea30f6e5a0d729 ]

    When slub_debug=O is set. It is possible to clear debug flags for an
    "unmergeable" slab cache in kmem_cache_open(). It makes the "unmergeable"
    cache became "mergeable" in sysfs_slab_add().

    These caches will generate their "unique IDs" by create_unique_id(), but
    it is possible to create identical unique IDs. In my experiment,
    sgpool-128, names_cache, biovec-256 generate the same ID ":Ft-0004096" and
    the kernel reports "sysfs: cannot create duplicate filename
    '/kernel/slab/:Ft-0004096'".

    To repeat my experiment, set disable_higher_order_debug=1,
    CONFIG_SLUB_DEBUG_ON=y in kernel-4.14.

    Fix this issue by setting unmergeable=1 if slub_debug=O and the the
    default slub_debug contains any no-merge flags.

    call path:
    kmem_cache_create()
    __kmem_cache_alias() -> we set SLAB_NEVER_MERGE flags here
    create_cache()
    __kmem_cache_create()
    kmem_cache_open() -> clear DEBUG_METADATA_FLAGS
    sysfs_slab_add() -> the slab cache is mergeable now

    sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096'
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 1 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x60/0x7c
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 4.14.0-rc7ajb-00131-gd4c2e9f-dirty #123
    Hardware name: linux,dummy-virt (DT)
    task: ffffffc07d4e0080 task.stack: ffffff8008008000
    PC is at sysfs_warn_dup+0x60/0x7c
    LR is at sysfs_warn_dup+0x60/0x7c
    pc : lr : pstate: 60000145
    Call trace:
    sysfs_warn_dup+0x60/0x7c
    sysfs_create_dir_ns+0x98/0xa0
    kobject_add_internal+0xa0/0x294
    kobject_init_and_add+0x90/0xb4
    sysfs_slab_add+0x90/0x200
    __kmem_cache_create+0x26c/0x438
    kmem_cache_create+0x164/0x1f4
    sg_pool_init+0x60/0x100
    do_one_initcall+0x38/0x12c
    kernel_init_freeable+0x138/0x1d4
    kernel_init+0x10/0xfc
    ret_from_fork+0x10/0x18

    Link: http://lkml.kernel.org/r/1510365805-5155-1-git-send-email-miles.chen@mediatek.com
    Signed-off-by: Miles Chen
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Miles Chen
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

14 Sep, 2017

1 commit

  • GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived
    and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
    primary motivation was to allow users to tell that an allocation is
    short lived and so the allocator can try to place such allocations close
    together and prevent long term fragmentation. As much as this sounds
    like a reasonable semantic it becomes much less clear when to use the
    highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
    context holding that memory sleep? Can it take locks? It seems there is
    no good answer for those questions.

    The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
    __GFP_RECLAIMABLE which in itself is tricky because basically none of
    the existing caller provide a way to reclaim the allocated memory. So
    this is rather misleading and hard to evaluate for any benefits.

    I have checked some random users and none of them has added the flag
    with a specific justification. I suspect most of them just copied from
    other existing users and others just thought it might be a good idea to
    use without any measuring. This suggests that GFP_TEMPORARY just
    motivates for cargo cult usage without any reasoning.

    I believe that our gfp flags are quite complex already and especially
    those with highlevel semantic should be clearly defined to prevent from
    confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
    replace all existing users to simply use GFP_KERNEL. Please note that
    SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
    so they will be placed properly for memory fragmentation prevention.

    I can see reasons we might want some gfp flag to reflect shorterm
    allocations but I propose starting from a clear semantic definition and
    only then add users with proper justification.

    This was been brought up before LSF this year by Matthew [1] and it
    turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
    seems to be a heuristic without any measured advantage for most (if not
    all) its current users. The follow up discussion has revealed that
    opinions on what might be temporary allocation differ a lot between
    developers. So rather than trying to tweak existing users into a
    semantic which they haven't expected I propose to simply remove the flag
    and start from scratch if we really need a semantic for short term
    allocations.

    [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org

    [akpm@linux-foundation.org: fix typo]
    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: drm/i915: fix up]
    Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Neil Brown
    Cc: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Sep, 2017

1 commit

  • First, number of CPUs can't be negative number.

    Second, different signnnedness leads to suboptimal code in the following
    cases:

    1)
    kmalloc(nr_cpu_ids * sizeof(X));

    "int" has to be sign extended to size_t.

    2)
    while (loff_t *pos < nr_cpu_ids)

    MOVSXD is 1 byte longed than the same MOV.

    Other cases exist as well. Basically compiler is told that nr_cpu_ids
    can't be negative which can't be deduced if it is "int".

    Code savings on allyesconfig kernel: -3KB

    add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370)
    function old new delta
    coretemp_cpu_online 450 512 +62
    rcu_init_one 1234 1272 +38
    pci_device_probe 374 399 +25

    ...

    pgdat_reclaimable_pages 628 556 -72
    select_fallback_rq 446 369 -77
    task_numa_find_cpu 1923 1807 -116

    Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

07 Sep, 2017

4 commits

  • attribute_group are not supposed to change at runtime. All functions
    working with attribute_group provided by work with const
    attribute_group. So mark the non-const structs as const.

    Link: http://lkml.kernel.org/r/1501157186-3749-1-git-send-email-arvind.yadav.cs@gmail.com
    Signed-off-by: Arvind Yadav
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arvind Yadav
     
  • Add an assertion similar to "fasttop" check in GNU C Library allocator
    as a part of SLAB_FREELIST_HARDENED feature. An object added to a
    singly linked freelist should not point to itself. That helps to detect
    some double free errors (e.g. CVE-2017-2636) without slub_debug and
    KASAN.

    Link: http://lkml.kernel.org/r/1502468246-1262-1-git-send-email-alex.popov@linux.com
    Signed-off-by: Alexander Popov
    Acked-by: Christoph Lameter
    Cc: Kees Cook
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Paul E McKenney
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Cc: Andy Lutomirski
    Cc: Nicolas Pitre
    Cc: Rik van Riel
    Cc: Tycho Andersen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Popov
     
  • This SLUB free list pointer obfuscation code is modified from Brad
    Spengler/PaX Team's code in the last public patch of grsecurity/PaX
    based on my understanding of the code. Changes or omissions from the
    original code are mine and don't reflect the original grsecurity/PaX
    code.

    This adds a per-cache random value to SLUB caches that is XORed with
    their freelist pointer address and value. This adds nearly zero
    overhead and frustrates the very common heap overflow exploitation
    method of overwriting freelist pointers.

    A recent example of the attack is written up here:

    http://cyseclabs.com/blog/cve-2016-6187-heap-off-by-one-exploit

    and there is a section dedicated to the technique the book "A Guide to
    Kernel Exploitation: Attacking the Core".

    This is based on patches by Daniel Micay, and refactored to minimize the
    use of #ifdef.

    With 200-count cycles of "hackbench -g 20 -l 1000" I saw the following
    run times:

    before:
    mean 10.11882499999999999995
    variance .03320378329145728642
    stdev .18221905304181911048

    after:
    mean 10.12654000000000000014
    variance .04700556623115577889
    stdev .21680767106160192064

    The difference gets lost in the noise, but if the above is to be taken
    literally, using CONFIG_FREELIST_HARDENED is 0.07% slower.

    Link: http://lkml.kernel.org/r/20170802180609.GA66807@beast
    Signed-off-by: Kees Cook
    Suggested-by: Daniel Micay
    Cc: Rik van Riel
    Cc: Tycho Andersen
    Cc: Alexander Popov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • - free_kmem_cache_nodes() frees the cache node before nulling out a
    reference to it

    - init_kmem_cache_nodes() publishes the cache node before initializing
    it

    Neither of these matter at runtime because the cache nodes cannot be
    looked up by any other thread. But it's neater and more consistent to
    reorder these.

    Link: http://lkml.kernel.org/r/20170707083408.40410-1-glider@google.com
    Signed-off-by: Alexander Potapenko
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

19 Aug, 2017

1 commit

  • To avoid a possible deadlock, sysfs_slab_remove() schedules an
    asynchronous work to delete sysfs entries corresponding to the kmem
    cache. To ensure the cache isn't freed before the work function is
    called, it takes a reference to the cache kobject. The reference is
    supposed to be released by the work function.

    However, the work function (sysfs_slab_remove_workfn()) does nothing in
    case the cache sysfs entry has already been deleted, leaking the kobject
    and the corresponding cache.

    This may happen on a per memcg cache destruction, because sysfs entries
    of a per memcg cache are deleted on memcg offline if the cache is empty
    (see __kmemcg_cache_deactivate()).

    The kmemleak report looks like this:

    unreferenced object 0xffff9f798a79f540 (size 32):
    comm "kworker/1:4", pid 15416, jiffies 4307432429 (age 28687.554s)
    hex dump (first 32 bytes):
    6b 6d 61 6c 6c 6f 63 2d 31 36 28 31 35 39 39 3a kmalloc-16(1599:
    6e 65 77 72 6f 6f 74 29 00 23 6b c0 ff ff ff ff newroot).#k.....
    backtrace:
    kmemleak_alloc+0x4a/0xa0
    __kmalloc_track_caller+0x148/0x2c0
    kvasprintf+0x66/0xd0
    kasprintf+0x49/0x70
    memcg_create_kmem_cache+0xe6/0x160
    memcg_kmem_cache_create_func+0x20/0x110
    process_one_work+0x205/0x5d0
    worker_thread+0x4e/0x3a0
    kthread+0x109/0x140
    ret_from_fork+0x2a/0x40
    unreferenced object 0xffff9f79b6136840 (size 416):
    comm "kworker/1:4", pid 15416, jiffies 4307432429 (age 28687.573s)
    hex dump (first 32 bytes):
    40 fb 80 c2 3e 33 00 00 00 00 00 40 00 00 00 00 @...>3.....@....
    00 00 00 00 00 00 00 00 10 00 00 00 10 00 00 00 ................
    backtrace:
    kmemleak_alloc+0x4a/0xa0
    kmem_cache_alloc+0x128/0x280
    create_cache+0x3b/0x1e0
    memcg_create_kmem_cache+0x118/0x160
    memcg_kmem_cache_create_func+0x20/0x110
    process_one_work+0x205/0x5d0
    worker_thread+0x4e/0x3a0
    kthread+0x109/0x140
    ret_from_fork+0x2a/0x40

    Fix the leak by adding the missing call to kobject_put() to
    sysfs_slab_remove_workfn().

    Link: http://lkml.kernel.org/r/20170812181134.25027-1-vdavydov.dev@gmail.com
    Fixes: 3b7b314053d02 ("slub: make sysfs file removal asynchronous")
    Signed-off-by: Vladimir Davydov
    Reported-by: Andrei Vagin
    Tested-by: Andrei Vagin
    Acked-by: Tejun Heo
    Acked-by: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: [4.12.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

07 Jul, 2017

6 commits

  • Josef's redesign of the balancing between slab caches and the page cache
    requires slab cache statistics at the lruvec level.

    Link: http://lkml.kernel.org/r/20170530181724.27197-7-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "mm: per-lruvec slab stats"

    Josef is working on a new approach to balancing slab caches and the page
    cache. For this to work, he needs slab cache statistics on the lruvec
    level. These patches implement that by adding infrastructure that
    allows updating and reading generic VM stat items per lruvec, then
    switches some existing VM accounting sites, including the slab
    accounting ones, to this new cgroup-aware API.

    I'll follow up with more patches on this, because there is actually
    substantial simplification that can be done to the memory controller
    when we replace private memcg accounting with making the existing VM
    accounting sites cgroup-aware. But this is enough for Josef to base his
    slab reclaim work on, so here goes.

    This patch (of 5):

    To re-implement slab cache vs. page cache balancing, we'll need the
    slab counters at the lruvec level, which, ever since lru reclaim was
    moved from the zone to the node, is the intersection of the node, not
    the zone, and the memcg.

    We could retain the per-zone counters for when the page allocator dumps
    its memory information on failures, and have counters on both levels -
    which on all but NUMA node 0 is usually redundant. But let's keep it
    simple for now and just move them. If anybody complains we can restore
    the per-zone counters.

    [hannes@cmpxchg.org: fix oops]
    Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org
    Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • kmem_cache->cpu_partial is just used when CONFIG_SLUB_CPU_PARTIAL is
    set, so wrap it with config CONFIG_SLUB_CPU_PARTIAL will save some space
    on 32bit arch.

    This patch wraps kmem_cache->cpu_partial in config CONFIG_SLUB_CPU_PARTIAL
    and wraps its sysfs too.

    Link: http://lkml.kernel.org/r/20170502144533.10729-4-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • cpu_slab's field partial is used when CONFIG_SLUB_CPU_PARTIAL is set,
    which means we can save a pointer's space on each cpu for every slub
    item.

    This patch wraps cpu_slab->partial in CONFIG_SLUB_CPU_PARTIAL and wraps
    its sysfs use too.

    [akpm@linux-foundation.org: avoid strange 80-col tricks]
    Link: http://lkml.kernel.org/r/20170502144533.10729-3-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Each time a slab is deactivated, the page and freelist pointer should be
    reset.

    This patch just merges these two options into deactivate_slab().

    Link: http://lkml.kernel.org/r/20170507031215.3130-2-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • When the code comes to this point, there are two cases:
    1. cpu_slab is deactivated
    2. cpu_slab is empty

    In both cased, cpu_slab->freelist is NULL at this moment.

    This patch removes the redundant assignment of cpu_slab->freelist.

    Link: http://lkml.kernel.org/r/20170507031215.3130-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     

24 Jun, 2017

1 commit

  • Commit bf5eb3de3847 ("slub: separate out sysfs_slab_release() from
    sysfs_slab_remove()") made slub sysfs file removals synchronous to
    kmem_cache shutdown.

    Unfortunately, this created a possible ABBA deadlock between slab_mutex
    and sysfs draining mechanism triggering the following lockdep warning.

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    4.10.0-test+ #48 Not tainted
    -------------------------------------------------------
    rmmod/1211 is trying to acquire lock:
    (s_active#120){++++.+}, at: [] kernfs_remove+0x23/0x40

    but task is already holding lock:
    (slab_mutex){+.+.+.}, at: [] kmem_cache_destroy+0x41/0x2d0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (slab_mutex){+.+.+.}:
    lock_acquire+0xf6/0x1f0
    __mutex_lock+0x75/0x950
    mutex_lock_nested+0x1b/0x20
    slab_attr_store+0x75/0xd0
    sysfs_kf_write+0x45/0x60
    kernfs_fop_write+0x13c/0x1c0
    __vfs_write+0x28/0x120
    vfs_write+0xc8/0x1e0
    SyS_write+0x49/0xa0
    entry_SYSCALL_64_fastpath+0x1f/0xc2

    -> #0 (s_active#120){++++.+}:
    __lock_acquire+0x10ed/0x1260
    lock_acquire+0xf6/0x1f0
    __kernfs_remove+0x254/0x320
    kernfs_remove+0x23/0x40
    sysfs_remove_dir+0x51/0x80
    kobject_del+0x18/0x50
    __kmem_cache_shutdown+0x3e6/0x460
    kmem_cache_destroy+0x1fb/0x2d0
    kvm_exit+0x2d/0x80 [kvm]
    vmx_exit+0x19/0xa1b [kvm_intel]
    SyS_delete_module+0x198/0x1f0
    entry_SYSCALL_64_fastpath+0x1f/0xc2

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(slab_mutex);
    lock(s_active#120);
    lock(slab_mutex);
    lock(s_active#120);

    *** DEADLOCK ***

    2 locks held by rmmod/1211:
    #0: (cpu_hotplug.dep_map){++++++}, at: [] get_online_cpus+0x37/0x80
    #1: (slab_mutex){+.+.+.}, at: [] kmem_cache_destroy+0x41/0x2d0

    stack backtrace:
    CPU: 3 PID: 1211 Comm: rmmod Not tainted 4.10.0-test+ #48
    Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
    Call Trace:
    print_circular_bug+0x1be/0x210
    __lock_acquire+0x10ed/0x1260
    lock_acquire+0xf6/0x1f0
    __kernfs_remove+0x254/0x320
    kernfs_remove+0x23/0x40
    sysfs_remove_dir+0x51/0x80
    kobject_del+0x18/0x50
    __kmem_cache_shutdown+0x3e6/0x460
    kmem_cache_destroy+0x1fb/0x2d0
    kvm_exit+0x2d/0x80 [kvm]
    vmx_exit+0x19/0xa1b [kvm_intel]
    SyS_delete_module+0x198/0x1f0
    ? SyS_delete_module+0x5/0x1f0
    entry_SYSCALL_64_fastpath+0x1f/0xc2

    It'd be the cleanest to deal with the issue by removing sysfs files
    without holding slab_mutex before the rest of shutdown; however, given
    the current code structure, it is pretty difficult to do so.

    This patch punts sysfs file removal to a work item. Before commit
    bf5eb3de3847, the removal was punted to a RCU delayed work item which is
    executed after release. Now, we're punting to a different work item on
    shutdown which still maintains the goal removing the sysfs files earlier
    when destroying kmem_caches.

    Link: http://lkml.kernel.org/r/20170620204512.GI21326@htj.duckdns.org
    Fixes: bf5eb3de3847 ("slub: separate out sysfs_slab_release() from sysfs_slab_remove()")
    Signed-off-by: Tejun Heo
    Reported-by: Steven Rostedt (VMware)
    Tested-by: Steven Rostedt (VMware)
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

03 Jun, 2017

1 commit

  • memcg_propagate_slab_attrs() abuses the sysfs attribute file functions
    to propagate settings from the root kmem_cache to a newly created
    kmem_cache. It does that with:

    attr->show(root, buf);
    attr->store(new, buf, strlen(bug);

    Aside of being a lazy and absurd hackery this is broken because it does
    not check the return value of the show() function.

    Some of the show() functions return 0 w/o touching the buffer. That
    means in such a case the store function is called with the stale content
    of the previous show(). That causes nonsense like invoking
    kmem_cache_shrink() on a newly created kmem_cache. In the worst case it
    would cause handing in an uninitialized buffer.

    This should be rewritten proper by adding a propagate() callback to
    those slub_attributes which must be propagated and avoid that insane
    conversion to and from ASCII, but that's too large for a hot fix.

    Check at least the return value of the show() function, so calling
    store() with stale content is prevented.

    Steven said:
    "It can cause a deadlock with get_online_cpus() that has been uncovered
    by recent cpu hotplug and lockdep changes that Thomas and Peter have
    been doing.

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(cpu_hotplug.lock);
    lock(slab_mutex);
    lock(cpu_hotplug.lock);
    lock(slab_mutex);

    *** DEADLOCK ***"

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705201244540.2255@nanos
    Signed-off-by: Thomas Gleixner
    Reported-by: Steven Rostedt
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Christoph Hellwig
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

19 Apr, 2017

1 commit

  • A group of Linux kernel hackers reported chasing a bug that resulted
    from their assumption that SLAB_DESTROY_BY_RCU provided an existence
    guarantee, that is, that no block from such a slab would be reallocated
    during an RCU read-side critical section. Of course, that is not the
    case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
    slab of blocks.

    However, there is a phrase for this, namely "type safety". This commit
    therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
    to avoid future instances of this sort of confusion.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    [ paulmck: Add comments mentioning the old name, as requested by Eric
    Dumazet, in order to help people familiar with the old name find
    the new one. ]
    Acked-by: David Rientjes

    Paul E. McKenney
     

23 Feb, 2017

8 commits

  • SLUB creates a per-cache directory under /sys/kernel/slab which hosts a
    bunch of debug files. Usually, there aren't that many caches on a
    system and this doesn't really matter; however, if memcg is in use, each
    cache can have per-cgroup sub-caches. SLUB creates the same directories
    for these sub-caches under /sys/kernel/slab/$CACHE/cgroup.

    Unfortunately, because there can be a lot of cgroups, active or
    draining, the product of the numbers of caches, cgroups and files in
    each directory can reach a very high number - hundreds of thousands is
    commonplace. Millions and beyond aren't difficult to reach either.

    What's under /sys/kernel/slab is primarily for debugging and the
    information and control on the a root cache already cover its
    sub-caches. While having a separate directory for each sub-cache can be
    helpful for development, it doesn't make much sense to pay this amount
    of overhead by default.

    This patch introduces a boot parameter slub_memcg_sysfs which determines
    whether to create sysfs directories for per-memcg sub-caches. It also
    adds CONFIG_SLUB_MEMCG_SYSFS_ON which determines the boot parameter's
    default value and defaults to 0.

    [akpm@linux-foundation.org: kset_unregister(NULL) is legal]
    Link: http://lkml.kernel.org/r/20170204145203.GB26958@mtj.duckdns.org
    Signed-off-by: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code. This is one of the patches to address the issue.

    Each cache has a number of sysfs interface files under /sys/kernel/slab.
    On a system with a lot of memory and transient memcgs, the number of
    interface files which have to be removed once memory reclaim kicks in
    can reach millions.

    Link: http://lkml.kernel.org/r/20170117235411.9408-10-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code. This is one of the patches to address the issue.

    slub uses synchronize_sched() to deactivate a memcg cache.
    synchronize_sched() is an expensive and slow operation and doesn't scale
    when a huge number of caches are destroyed back-to-back. While there
    used to be a simple batching mechanism, the batching was too restricted
    to be helpful.

    This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub
    can use to schedule sched RCU callback instead of performing
    synchronize_sched() synchronously while holding cgroup_mutex. While
    this adds online cpus, mems and slab_mutex operations, operating on
    these locks back-to-back from the same kworker, which is what's gonna
    happen when there are many to deactivate, isn't expensive at all and
    this gets rid of the scalability problem completely.

    Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • __kmem_cache_shrink() is called with %true @deactivate only for memcg
    caches. Remove @deactivate from __kmem_cache_shrink() and introduce
    __kmemcg_cache_deactivate() instead. Each memcg-supporting allocator
    should implement it and it should deactivate and drain the cache.

    This is to allow memcg cache deactivation behavior to further deviate
    from simple shrinking without messing up __kmem_cache_shrink().

    This is pure reorganization and doesn't introduce any observable
    behavior changes.

    v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.

    Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.org
    Signed-off-by: Tejun Heo
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code. This is one of the patches to address the issue.

    slab_caches currently lists all caches including root and memcg ones.
    This is the only data structure which lists the root caches and
    iterating root caches can only be done by walking the list while
    skipping over memcg caches. As there can be a huge number of memcg
    caches, this can become very expensive.

    This also can make /proc/slabinfo behave very badly. seq_file processes
    reads in 4k chunks and seeks to the previous Nth position on slab_caches
    list to resume after each chunk. With a lot of memcg cache churns on
    the list, reading /proc/slabinfo can become very slow and its content
    often ends up with duplicate and/or missing entries.

    This patch adds a new list slab_root_caches which lists only the root
    caches. When memcg is not enabled, it becomes just an alias of
    slab_caches. memcg specific list operations are collected into
    memcg_[un]link_cache().

    Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Separate out slub sysfs removal and release, and call the former earlier
    from __kmem_cache_shutdown(). There's no reason to defer sysfs removal
    through RCU and this will later allow us to remove sysfs files way
    earlier during memory cgroup offline instead of release.

    Link: http://lkml.kernel.org/r/20170117235411.9408-3-tj@kernel.org
    Signed-off-by: Tejun Heo
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Patch series "slab: make memcg slab destruction scalable", v3.

    With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure. When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code.

    I've seen machines which end up with hundred thousands of caches and
    many millions of kernfs_nodes. The current code is O(N^2) on the total
    number of caches and has synchronous rcu_barrier() and
    synchronize_sched() in cgroup offline / release path which is executed
    while holding cgroup_mutex. Combined, this leads to very expensive and
    slow cache destruction operations which can easily keep running for half
    a day.

    This also messes up /proc/slabinfo along with other cache iterating
    operations. seq_file operates on 4k chunks and on each 4k boundary
    tries to seek to the last position in the list. With a huge number of
    caches on the list, this becomes very slow and very prone to the list
    content changing underneath it leading to a lot of missing and/or
    duplicate entries.

    This patchset addresses the scalability problem.

    * Add root and per-memcg lists. Update each user to use the
    appropriate list.

    * Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched
    and asynchronous.

    * For dying empty slub caches, remove the sysfs files after
    deactivation so that we don't end up with millions of sysfs files
    without any useful information on them.

    This patchset contains the following nine patches.

    0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
    0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch
    0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
    0004-slab-reorganize-memcg_cache_params.patch
    0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
    0006-slab-implement-slab_root_caches-list.patch
    0007-slab-introduce-__kmemcg_cache_deactivate.patch
    0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
    0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch
    0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch

    0001 reverts an existing optimization to prepare for the following
    changes. 0002 is a prep patch. 0003 makes rcu_barrier() in release
    path batched and asynchronous. 0004-0006 separate out the lists.
    0007-0008 replace synchronize_sched() in slub destruction path with
    call_rcu_sched(). 0009 removes sysfs files early for empty dying
    caches. 0010 makes destruction work items use a workqueue with limited
    concurrency.

    This patch (of 10):

    Revert 89e364db71fb5e ("slub: move synchronize_sched out of slab_mutex on
    shrink").

    With kmem cgroup support enabled, kmem_caches can be created and destroyed
    frequently and a great number of near empty kmem_caches can accumulate if
    there are a lot of transient cgroups and the system is not under memory
    pressure. When memory reclaim starts under such conditions, it can lead
    to consecutive deactivation and destruction of many kmem_caches, easily
    hundreds of thousands on moderately large systems, exposing scalability
    issues in the current slab management code. This is one of the patches to
    address the issue.

    Moving synchronize_sched() out of slab_mutex isn't enough as it's still
    inside cgroup_mutex. The whole deactivation / release path will be
    updated to avoid all synchronous RCU operations. Revert this insufficient
    optimization in preparation to ease future changes.

    Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reported-by: Jay Vana
    Cc: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • We wish to know who is doing such a thing. slab.c does this.

    Link: http://lkml.kernel.org/r/20170116091643.15260-1-bp@alien8.de
    Signed-off-by: Borislav Petkov
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     

09 Feb, 2017

1 commit

  • Commit 210e7a43fa90 ("mm: SLUB freelist randomization") broke USB hub
    initialisation as described in

    https://bugzilla.kernel.org/show_bug.cgi?id=177551.

    Bail out early from init_cache_random_seq if s->random_seq is already
    initialised. This prevents destroying the previously computed
    random_seq offsets later in the function.

    If the offsets are destroyed, then shuffle_freelist will truncate
    page->freelist to just the first object (orphaning the rest).

    Fixes: 210e7a43fa90 ("mm: SLUB freelist randomization")
    Link: http://lkml.kernel.org/r/20170207140707.20824-1-sean@erifax.org
    Signed-off-by: Sean Rees
    Reported-by:
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Thomas Garnier
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sean Rees
     

25 Jan, 2017

1 commit

  • Currently when trace is enabled (e.g. slub_debug=T,kmalloc-128 ) the
    trace messages are mostly output at KERN_INFO. However the trace code
    also calls print_section() to hexdump the head of a free object. This
    is hard coded to use KERN_ERR, meaning the console is deluged with trace
    messages even if we've asked for quiet.

    Fix this the obvious way but adding a level parameter to
    print_section(), allowing calls from the trace code to use the same
    trace level as other trace messages.

    Link: http://lkml.kernel.org/r/20170113154850.518-1-daniel.thompson@linaro.org
    Signed-off-by: Daniel Thompson
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Thompson
     

13 Dec, 2016

2 commits

  • The slub allocator gives us some incorrect warnings when
    CONFIG_PROFILE_ANNOTATED_BRANCHES is set, as the unlikely() macro
    prevents it from seeing that the return code matches what it was before:

    mm/slub.c: In function `kmem_cache_free_bulk':
    mm/slub.c:262:23: error: `df.s' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    mm/slub.c:2943:3: error: `df.cnt' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    mm/slub.c:2933:4470: error: `df.freelist' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    mm/slub.c:2943:3: error: `df.tail' may be used uninitialized in this function [-Werror=maybe-uninitialized]

    I have not been able to come up with a perfect way for dealing with
    this, the three options I see are:

    - add a bogus initialization, which would increase the runtime overhead
    - replace unlikely() with unlikely_notrace()
    - remove the unlikely() annotation completely

    I checked the object code for a typical x86 configuration and the last
    two cases produce the same result, so I went for the last one, which is
    the simplest.

    Link: http://lkml.kernel.org/r/20161024155704.3114445-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: Jesper Dangaard Brouer
    Cc: Arnd Bergmann
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Laura Abbott
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • synchronize_sched() is a heavy operation and calling it per each cache
    owned by a memory cgroup being destroyed may take quite some time. What
    is worse, it's currently called under the slab_mutex, stalling all works
    doing cache creation/destruction.

    Actually, there isn't much point in calling synchronize_sched() for each
    cache - it's enough to call it just once - after setting cpu_partial for
    all caches and before shrinking them. This way, we can also move it out
    of the slab_mutex, which we have to hold for iterating over the slab
    cache list.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
    Link: http://lkml.kernel.org/r/0a10d71ecae3db00fb4421bcd3f82bcc911f4be4.1475329751.git.vdavydov.dev@gmail.com
    Signed-off-by: Vladimir Davydov
    Reported-by: Doug Smythies
    Acked-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

07 Sep, 2016

1 commit

  • Install the callbacks via the state machine.

    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Pekka Enberg
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20160818125731.27256-5-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

11 Aug, 2016

1 commit

  • With debugobjects enabled and using SLAB_DESTROY_BY_RCU, when a
    kmem_cache_node is destroyed the call_rcu() may trigger a slab
    allocation to fill the debug object pool (__debug_object_init:fill_pool).

    Everywhere but during kmem_cache_destroy(), discard_slab() is performed
    outside of the kmem_cache_node->list_lock and avoids a lockdep warning
    about potential recursion:

    =============================================
    [ INFO: possible recursive locking detected ]
    4.8.0-rc1-gfxbench+ #1 Tainted: G U
    ---------------------------------------------
    rmmod/8895 is trying to acquire lock:
    (&(&n->list_lock)->rlock){-.-...}, at: [] get_partial_node.isra.63+0x47/0x430

    but task is already holding lock:
    (&(&n->list_lock)->rlock){-.-...}, at: [] __kmem_cache_shutdown+0x54/0x320

    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0
    ----
    lock(&(&n->list_lock)->rlock);
    lock(&(&n->list_lock)->rlock);

    *** DEADLOCK ***
    May be due to missing lock nesting notation
    5 locks held by rmmod/8895:
    #0: (&dev->mutex){......}, at: driver_detach+0x42/0xc0
    #1: (&dev->mutex){......}, at: driver_detach+0x50/0xc0
    #2: (cpu_hotplug.dep_map){++++++}, at: get_online_cpus+0x2d/0x80
    #3: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x3c/0x220
    #4: (&(&n->list_lock)->rlock){-.-...}, at: __kmem_cache_shutdown+0x54/0x320

    stack backtrace:
    CPU: 6 PID: 8895 Comm: rmmod Tainted: G U 4.8.0-rc1-gfxbench+ #1
    Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015
    Call Trace:
    __lock_acquire+0x1646/0x1ad0
    lock_acquire+0xb2/0x200
    _raw_spin_lock+0x36/0x50
    get_partial_node.isra.63+0x47/0x430
    ___slab_alloc.constprop.67+0x1a7/0x3b0
    __slab_alloc.isra.64.constprop.66+0x43/0x80
    kmem_cache_alloc+0x236/0x2d0
    __debug_object_init+0x2de/0x400
    debug_object_activate+0x109/0x1e0
    __call_rcu.constprop.63+0x32/0x2f0
    call_rcu+0x12/0x20
    discard_slab+0x3d/0x40
    __kmem_cache_shutdown+0xdb/0x320
    shutdown_cache+0x19/0x60
    kmem_cache_destroy+0x1ae/0x220
    i915_gem_load_cleanup+0x14/0x40 [i915]
    i915_driver_unload+0x151/0x180 [i915]
    i915_pci_remove+0x14/0x20 [i915]
    pci_device_remove+0x34/0xb0
    __device_release_driver+0x95/0x140
    driver_detach+0xb6/0xc0
    bus_remove_driver+0x53/0xd0
    driver_unregister+0x27/0x50
    pci_unregister_driver+0x25/0x70
    i915_exit+0x1a/0x1e2 [i915]
    SyS_delete_module+0x193/0x1f0
    entry_SYSCALL_64_fastpath+0x1c/0xac

    Fixes: 52b4b950b507 ("mm: slab: free kmem_cache_node after destroy sysfs file")
    Link: http://lkml.kernel.org/r/1470759070-18743-1-git-send-email-chris@chris-wilson.co.uk
    Reported-by: Dave Gordon
    Signed-off-by: Chris Wilson
    Reviewed-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dmitry Safonov
    Cc: Daniel Vetter
    Cc: Dave Gordon
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     

09 Aug, 2016

1 commit

  • Pull usercopy protection from Kees Cook:
    "Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
    copy_from_user bounds checking for most architectures on SLAB and
    SLUB"

    * tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    mm: SLUB hardened usercopy support
    mm: SLAB hardened usercopy support
    s390/uaccess: Enable hardened usercopy
    sparc/uaccess: Enable hardened usercopy
    powerpc/uaccess: Enable hardened usercopy
    ia64/uaccess: Enable hardened usercopy
    arm64/uaccess: Enable hardened usercopy
    ARM: uaccess: Enable hardened usercopy
    x86/uaccess: Enable hardened usercopy
    mm: Hardened usercopy
    mm: Implement stack frame object validation
    mm: Add is_migrate_cma_page

    Linus Torvalds