06 Mar, 2019

2 commits

  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • This is the start of a series of patches similar to my earlier
    DEFINE_MEMCG_MAX_OR_VAL work, but with less Macro Magic(tm).

    There are a bunch of places we go from seq_file to mem_cgroup, which
    currently requires manually getting the css, then getting the mem_cgroup
    from the css. It's in enough places now that having mem_cgroup_from_seq
    makes sense (and also makes the next patch a bit nicer).

    Link: http://lkml.kernel.org/r/20190124194050.GA31341@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     

22 Feb, 2019

2 commits

  • kmemleak keeps two global variables, min_addr and max_addr, which store
    the range of valid (encountered by kmemleak) pointer values, which it
    later uses to speed up pointer lookup when scanning blocks.

    With tagged pointers this range will get bigger than it needs to be. This
    patch makes kmemleak untag pointers before saving them to min_addr and
    max_addr and when performing a lookup.

    Link: http://lkml.kernel.org/r/16e887d442986ab87fe87a755815ad92fa431a5f.1550066133.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Tested-by: Qian Cai
    Acked-by: Catalin Marinas
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dmitry Vyukov
    Cc: Evgeniy Stepanov
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Pekka Enberg
    Cc: Vincenzo Frascino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Right now we call kmemleak hooks before assigning tags to pointers in
    KASAN hooks. As a result, when an objects gets allocated, kmemleak sees a
    differently tagged pointer, compared to the one it sees when the object
    gets freed. Fix it by calling KASAN hooks before kmemleak's ones.

    Link: http://lkml.kernel.org/r/cd825aa4897b0fc37d3316838993881daccbe9f5.1549921721.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reported-by: Qian Cai
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Catalin Marinas
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dmitry Vyukov
    Cc: Evgeniy Stepanov
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Pekka Enberg
    Cc: Vincenzo Frascino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

30 Dec, 2018

1 commit

  • Pull documentation update from Jonathan Corbet:
    "A fairly normal cycle for documentation stuff. We have a new document
    on perf security, more Italian translations, more improvements to the
    memory-management docs, improvements to the pathname lookup
    documentation, and the usual array of smaller fixes.

    As is often the case, there are a few reaches outside of
    Documentation/ to adjust kerneldoc comments"

    * tag 'docs-5.0' of git://git.lwn.net/linux: (38 commits)
    docs: improve pathname-lookup document structure
    configfs: fix wrong name of struct in documentation
    docs/mm-api: link slab_common.c to "The Slab Cache" section
    slab: make kmem_cache_create{_usercopy} description proper kernel-doc
    doc:process: add links where missing
    docs/core-api: make mm-api.rst more structured
    x86, boot: documentation whitespace fixup
    Documentation: devres: note checking needs when converting
    doc:it: add some process/* translations
    doc:it: fixes in process/1.Intro
    Documentation: convert path-lookup from markdown to resturctured text
    Documentation/admin-guide: update admin-guide index.rst
    Documentation/admin-guide: introduce perf-security.rst file
    scripts/kernel-doc: Fix struct and struct field attribute processing
    Documentation: dev-tools: Fix typos in index.rst
    Correct gen_init_cpio tool's documentation
    Document /proc/pid PID reuse behavior
    Documentation: update path-lookup.md for parallel lookups
    Documentation: Use "while" instead of "whilst"
    dmaengine: Add mailing list address to the documentation
    ...

    Linus Torvalds
     

29 Dec, 2018

3 commits

  • WARN_ON() already contains an unlikely(), so it's not necessary to use
    unlikely.

    Also change WARN_ON() back to WARN_ON_ONCE() to avoid potentially
    spamming dmesg with user-triggerable large allocations.

    [akpm@linux-foundation.org: s/WARN_ON/WARN_ON_ONCE/, per Vlastimil]
    Link: http://lkml.kernel.org/r/20181104125028.3572-1-tiny.windzz@gmail.com
    Signed-off-by: Yangtao Li
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrew Morton
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yangtao Li
     
  • The krealloc function checks where the same buffer was reused or a new one
    allocated by comparing kernel pointers. Tag-based KASAN changes memory
    tag on the krealloc'ed chunk of memory and therefore also changes the
    pointer tag of the returned pointer. Therefore we need to perform
    comparison on untagged (with tags reset) pointers to check whether it's
    the same memory region or not.

    Link: http://lkml.kernel.org/r/14f6190d7846186a3506cd66d82446646fe65090.1544099024.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Mark Rutland
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Patch series "kasan: add software tag-based mode for arm64", v13.

    This patchset adds a new software tag-based mode to KASAN [1]. (Initially
    this mode was called KHWASAN, but it got renamed, see the naming rationale
    at the end of this section).

    The plan is to implement HWASan [2] for the kernel with the incentive,
    that it's going to have comparable to KASAN performance, but in the same
    time consume much less memory, trading that off for somewhat imprecise bug
    detection and being supported only for arm64.

    The underlying ideas of the approach used by software tag-based KASAN are:

    1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
    pointer tags in the top byte of each kernel pointer.

    2. Using shadow memory, we can store memory tags for each chunk of kernel
    memory.

    3. On each memory allocation, we can generate a random tag, embed it into
    the returned pointer and set the memory tags that correspond to this
    chunk of memory to the same value.

    4. By using compiler instrumentation, before each memory access we can add
    a check that the pointer tag matches the tag of the memory that is being
    accessed.

    5. On a tag mismatch we report an error.

    With this patchset the existing KASAN mode gets renamed to generic KASAN,
    with the word "generic" meaning that the implementation can be supported
    by any architecture as it is purely software.

    The new mode this patchset adds is called software tag-based KASAN. The
    word "tag-based" refers to the fact that this mode uses tags embedded into
    the top byte of kernel pointers and the TBI arm64 CPU feature that allows
    to dereference such pointers. The word "software" here means that shadow
    memory manipulation and tag checking on pointer dereference is done in
    software. As it is the only tag-based implementation right now, "software
    tag-based" KASAN is sometimes referred to as simply "tag-based" in this
    patchset.

    A potential expansion of this mode is a hardware tag-based mode, which
    would use hardware memory tagging support (announced by Arm [3]) instead
    of compiler instrumentation and manual shadow memory manipulation.

    Same as generic KASAN, software tag-based KASAN is strictly a debugging
    feature.

    [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html

    [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html

    [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a

    ====== Rationale

    On mobile devices generic KASAN's memory usage is significant problem.
    One of the main reasons to have tag-based KASAN is to be able to perform a
    similar set of checks as the generic one does, but with lower memory
    requirements.

    Comment from Vishwath Mohan :

    I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
    problematic to enable for environments that don't tolerate the increased
    memory pressure well. This includes

    (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
    (c) Connected components like Pixel's visual core [1].

    These are both places I'd love to have a low(er) memory footprint option at
    my disposal.

    Comment from Evgenii Stepanov :

    Looking at a live Android device under load, slab (according to
    /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
    overhead of 2x - 3x on top of it is not insignificant.

    Not having this overhead enables near-production use - ex. running
    KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
    not reproduce in test configuration. These are the ones that often cost
    the most engineering time to track down.

    CPU overhead is bad, but generally tolerable. RAM is critical, in our
    experience. Once it gets low enough, OOM-killer makes your life
    miserable.

    [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/

    ====== Technical details

    Software tag-based KASAN mode is implemented in a very similar way to the
    generic one. This patchset essentially does the following:

    1. TCR_TBI1 is set to enable Top Byte Ignore.

    2. Shadow memory is used (with a different scale, 1:16, so each shadow
    byte corresponds to 16 bytes of kernel memory) to store memory tags.

    3. All slab objects are aligned to shadow scale, which is 16 bytes.

    4. All pointers returned from the slab allocator are tagged with a random
    tag and the corresponding shadow memory is poisoned with the same value.

    5. Compiler instrumentation is used to insert tag checks. Either by
    calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
    CONFIG_KASAN_INLINE flags are reused).

    6. When a tag mismatch is detected in callback instrumentation mode
    KASAN simply prints a bug report. In case of inline instrumentation,
    clang inserts a brk instruction, and KASAN has it's own brk handler,
    which reports the bug.

    7. The memory in between slab objects is marked with a reserved tag, and
    acts as a redzone.

    8. When a slab object is freed it's marked with a reserved tag.

    Bug detection is imprecise for two reasons:

    1. We won't catch some small out-of-bounds accesses, that fall into the
    same shadow cell, as the last byte of a slab object.

    2. We only have 1 byte to store tags, which means we have a 1/256
    probability of a tag match for an incorrect access (actually even
    slightly less due to reserved tag values).

    Despite that there's a particular type of bugs that tag-based KASAN can
    detect compared to generic KASAN: use-after-free after the object has been
    allocated by someone else.

    ====== Testing

    Some kernel developers voiced a concern that changing the top byte of
    kernel pointers may lead to subtle bugs that are difficult to discover.
    To address this concern deliberate testing has been performed.

    It doesn't seem feasible to do some kind of static checking to find
    potential issues with pointer tagging, so a dynamic approach was taken.
    All pointer comparisons/subtractions have been instrumented in an LLVM
    compiler pass and a kernel module that would print a bug report whenever
    two pointers with different tags are being compared/subtracted (ignoring
    comparisons with NULL pointers and with pointers obtained by casting an
    error code to a pointer type) has been used. Then the kernel has been
    booted in QEMU and on an Odroid C2 board and syzkaller has been run.

    This yielded the following results.

    The two places that look interesting are:

    is_vmalloc_addr in include/linux/mm.h
    is_kernel_rodata in mm/util.c

    Here we compare a pointer with some fixed untagged values to make sure
    that the pointer lies in a particular part of the kernel address space.
    Since tag-based KASAN doesn't add tags to pointers that belong to rodata
    or vmalloc regions, this should work as is. To make sure debug checks to
    those two functions that check that the result doesn't change whether we
    operate on pointers with or without untagging has been added.

    A few other cases that don't look that interesting:

    Comparing pointers to achieve unique sorting order of pointee objects
    (e.g. sorting locks addresses before performing a double lock):

    tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
    pipe_double_lock in fs/pipe.c
    unix_state_double_lock in net/unix/af_unix.c
    lock_two_nondirectories in fs/inode.c
    mutex_lock_double in kernel/events/core.c

    ep_cmp_ffd in fs/eventpoll.c
    fsnotify_compare_groups fs/notify/mark.c

    Nothing needs to be done here, since the tags embedded into pointers
    don't change, so the sorting order would still be unique.

    Checks that a pointer belongs to some particular allocation:

    is_sibling_entry in lib/radix-tree.c
    object_is_on_stack in include/linux/sched/task_stack.h

    Nothing needs to be done here either, since two pointers can only belong
    to the same allocation if they have the same tag.

    Overall, since the kernel boots and works, there are no critical bugs.
    As for the rest, the traditional kernel testing way (use until fails) is
    the only one that looks feasible.

    Another point here is that tag-based KASAN is available under a separate
    config option that needs to be deliberately enabled. Even though it might
    be used in a "near-production" environment to find bugs that are not found
    during fuzzing or running tests, it is still a debug tool.

    ====== Benchmarks

    The following numbers were collected on Odroid C2 board. Both generic and
    tag-based KASAN were used in inline instrumentation mode.

    Boot time [1]:
    * ~1.7 sec for clean kernel
    * ~5.0 sec for generic KASAN
    * ~5.0 sec for tag-based KASAN

    Network performance [2]:
    * 8.33 Gbits/sec for clean kernel
    * 3.17 Gbits/sec for generic KASAN
    * 2.85 Gbits/sec for tag-based KASAN

    Slab memory usage after boot [3]:
    * ~40 kb for clean kernel
    * ~105 kb (~260% overhead) for generic KASAN
    * ~47 kb (~20% overhead) for tag-based KASAN

    KASAN memory overhead consists of three main parts:
    1. Increased slab memory usage due to redzones.
    2. Shadow memory (the whole reserved once during boot).
    3. Quaratine (grows gradually until some preset limit; the more the limit,
    the more the chance to detect a use-after-free).

    Comparing tag-based vs generic KASAN for each of these points:
    1. 20% vs 260% overhead.
    2. 1/16th vs 1/8th of physical memory.
    3. Tag-based KASAN doesn't require quarantine.

    [1] Time before the ext4 driver is initialized.
    [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
    [3] Measured as `cat /proc/meminfo | grep Slab`.

    ====== Some notes

    A few notes:

    1. The patchset can be found here:
    https://github.com/xairy/kasan-prototype/tree/khwasan

    2. Building requires a recent Clang version (7.0.0 or later).

    3. Stack instrumentation is not supported yet and will be added later.

    This patch (of 25):

    Tag-based KASAN changes the value of the top byte of pointers returned
    from the kernel allocation functions (such as kmalloc). This patch
    updates KASAN hooks signatures and their usage in SLAB and SLUB code to
    reflect that.

    Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Mark Rutland
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

20 Dec, 2018

1 commit


28 Nov, 2018

1 commit

  • Now that synchronize_rcu() waits for preempt-disable regions of code
    as well as RCU read-side critical sections, synchronize_sched() can be
    replaced by synchronize_rcu(). This commit therefore makes this change.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:

    Paul E. McKenney
     

27 Oct, 2018

4 commits

  • Kmalloc cache names can get quite long for large object sizes, when the
    sizes are expressed in bytes. Use 'k' and 'M' prefixes to make the names
    as short as possible e.g. in /proc/slabinfo. This works, as we mostly
    use power-of-two sizes, with exceptions only below 1k.

    Example: 'kmalloc-4194304' becomes 'kmalloc-4M'

    Link: http://lkml.kernel.org/r/20180731090649.16028-7-vbabka@suse.cz
    Suggested-by: Matthew Wilcox
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Laura Abbott
    Cc: Michal Hocko
    Cc: Sumit Semwal
    Cc: Vijayanand Jitta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
    indicates they contain objects which can be reclaimed under memory
    pressure (typically through a shrinker). This makes the slab pages
    accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
    MemAvailable meminfo counter and in overcommit decisions. The slab pages
    are also allocated with __GFP_RECLAIMABLE, which is good for
    anti-fragmentation through grouping pages by mobility.

    The generic kmalloc-X caches are created without this flag, but sometimes
    are used also for objects that can be reclaimed, which due to varying size
    cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
    prominent example are dcache external names, which prompted the creation
    of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
    in commit f1782c9bc547 ("dcache: account external names as indirectly
    reclaimable memory").

    To better handle this and any other similar cases, this patch introduces
    SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
    They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
    gfp flags. They are added to the kmalloc_caches array as a new type.
    Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
    cache.

    This change only applies to SLAB and SLUB, not SLOB. This is fine, since
    SLOB's target are tiny system and this patch does add some overhead of
    kmem management objects.

    Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Laura Abbott
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Sumit Semwal
    Cc: Vijayanand Jitta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "kmalloc-reclaimable caches", v4.

    As discussed at LSF/MM [1] here's a patchset that introduces
    kmalloc-reclaimable caches (more details in the second patch) and uses
    them for dcache external names. That allows us to repurpose the
    NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.

    With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
    caches, eliminating the need for manual accounting. More importantly, it
    also ensures the reclaimable kmalloc allocations are grouped in pages
    separate from the regular kmalloc allocations. The need for proper
    accounting of dcache external names has shown it's easy for misbehaving
    process to allocate lots of them, causing premature OOMs. Without the
    added grouping, it's likely that a similar workload can interleave the
    dcache external names allocations with regular kmalloc allocations (note:
    I haven't searched myself for an example of such regular kmalloc
    allocation, but I would be very surprised if there wasn't some). A
    pathological case would be e.g. one 64byte regular allocations with 63
    external dcache names in a page (64x64=4096), which means the page is not
    freed even after reclaiming after all dcache names, and the process can
    thus "steal" the whole page with single 64byte allocation.

    If other kmalloc users similar to dcache external names become identified,
    they can also benefit from the new functionality simply by adding
    __GFP_RECLAIMABLE to the kmalloc calls.

    Side benefits of the patchset (that could be also merged separately)
    include removed branch for detecting __GFP_DMA kmalloc(), and shortening
    kmalloc cache names in /proc/slabinfo output. The latter is potentially
    an ABI break in case there are tools parsing the names and expecting the
    values to be in bytes.

    This is how /proc/slabinfo looks like after booting in virtme:

    ...
    kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
    ...
    kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
    kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
    kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
    kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
    kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
    kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
    ...

    /proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:

    ...
    nr_slab_reclaimable 2817
    nr_slab_unreclaimable 1781
    ...
    nr_kernel_misc_reclaimable 0
    ...

    /proc/meminfo with new KReclaimable counter:

    ...
    Shmem: 564 kB
    KReclaimable: 11260 kB
    Slab: 18368 kB
    SReclaimable: 11260 kB
    SUnreclaim: 7108 kB
    KernelStack: 1248 kB
    ...

    This patch (of 6):

    The kmalloc caches currently mainain separate (optional) array
    kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
    __GFP_DMA in the allocation hotpaths. We can avoid the branches by
    combining kmalloc_caches and kmalloc_dma_caches into a single
    two-dimensional array where the outer dimension is cache "type". This
    will also allow to add kmalloc-reclaimable caches as a third type.

    Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Laura Abbott
    Cc: Sumit Semwal
    Cc: Vijayanand Jitta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE,
    instead it falls back to kmalloc_large().

    For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls
    kmalloc_slab() for all allocations relying on NULL return value for
    over-sized allocations.

    This inconsistency leads to unwanted warnings from kmalloc_slab() for
    over-sized allocations for slab. Returning NULL for failed allocations is
    the expected behavior.

    Make slub and slab code consistent by checking size >
    KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab().

    While we are here also fix the check in kmalloc_slab(). We should check
    against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE. It all kinda
    worked because for slab the constants are the same, and slub always checks
    the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab(). But if we
    get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will
    happen. For example, in case of a newly introduced bug in slub code.

    Also move the check in kmalloc_slab() from function entry to the size >
    192 case. This partially compensates for the additional check in slab
    code and makes slub code a bit faster (at least theoretically).

    Also drop __GFP_NOWARN in the warning check. This warning means a bug in
    slab code itself, user-passed flags have nothing to do with it.

    Nothing of this affects slob.

    Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.com
    Signed-off-by: Dmitry Vyukov
    Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com
    Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com
    Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com
    Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com
    Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com
    Acked-by: Christoph Lameter
    Acked-by: Vlastimil Babka
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

18 Aug, 2018

1 commit

  • Introduce new config option, which is used to replace repeating
    CONFIG_MEMCG && !CONFIG_SLOB pattern. Next patches add a little more
    memcg+kmem related code, so let's keep the defines more clearly.

    Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

29 Jun, 2018

1 commit

  • In kernel 4.17 I removed some code from dm-bufio that did slab cache
    merging (commit 21bb13276768: "dm bufio: remove code that merges slab
    caches") - both slab and slub support merging caches with identical
    attributes, so dm-bufio now just calls kmem_cache_create and relies on
    implicit merging.

    This uncovered a bug in the slub subsystem - if we delete a cache and
    immediatelly create another cache with the same attributes, it fails
    because of duplicate filename in /sys/kernel/slab/. The slub subsystem
    offloads freeing the cache to a workqueue - and if we create the new
    cache before the workqueue runs, it complains because of duplicate
    filename in sysfs.

    This patch fixes the bug by moving the call of kobject_del from
    sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called
    while we hold slab_mutex - so that the sysfs entry is deleted before a
    cache with the same attributes could be created.

    Running device-mapper-test-suite with:

    dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/

    triggered:

    Buffer I/O error on dev dm-0, logical block 1572848, async page read
    device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
    device-mapper: thin: 253:1: aborting current metadata transaction
    sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
    CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
    Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
    Workqueue: dm-thin do_worker [dm_thin_pool]
    Call Trace:
    dump_stack+0x5a/0x73
    sysfs_warn_dup+0x58/0x70
    sysfs_create_dir_ns+0x77/0x80
    kobject_add_internal+0xba/0x2e0
    kobject_init_and_add+0x70/0xb0
    sysfs_slab_add+0xb1/0x250
    __kmem_cache_create+0x116/0x150
    create_cache+0xd9/0x1f0
    kmem_cache_create_usercopy+0x1c1/0x250
    kmem_cache_create+0x18/0x20
    dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
    dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
    __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
    dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
    metadata_operation_failed+0x59/0x100 [dm_thin_pool]
    alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
    process_cell+0x2a3/0x550 [dm_thin_pool]
    do_worker+0x28d/0x8f0 [dm_thin_pool]
    process_one_work+0x171/0x370
    worker_thread+0x49/0x3f0
    kthread+0xf8/0x130
    ret_from_fork+0x35/0x40
    kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
    kmem_cache_create(dm_bufio_buffer-16) failed with error -17

    Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Mikulas Patocka
    Reported-by: Mike Snitzer
    Tested-by: Mike Snitzer
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

15 Jun, 2018

2 commits

  • mm/*.c files use symbolic and octal styles for permissions.

    Using octal and not symbolic permissions is preferred by many as more
    readable.

    https://lkml.org/lkml/2016/8/2/1945

    Prefer the direct use of octal for permissions.

    Done using
    $ scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace mm/*.c
    and some typing.

    Before: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
    44
    After: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
    86

    Miscellanea:

    o Whitespace neatening around these conversions.

    Link: http://lkml.kernel.org/r/2e032ef111eebcd4c5952bae86763b541d373469.1522102887.git.joe@perches.com
    Signed-off-by: Joe Perches
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The memcg kmem cache creation and deactivation (SLUB only) is
    asynchronous. If a root kmem cache is destroyed whose memcg cache is in
    the process of creation or deactivation, the kernel may crash.

    Example of one such crash:
    general protection fault: 0000 [#1] SMP PTI
    CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
    ...
    Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
    RIP: 0010:has_cpu_slab
    ...
    Call Trace:
    ? on_each_cpu_cond
    __kmem_cache_shrink
    kmemcg_cache_deact_after_rcu
    kmemcg_deactivate_workfn
    process_one_work
    worker_thread
    kthread
    ret_from_fork+0x35/0x40

    To fix this race, on root kmem cache destruction, mark the cache as
    dying and flush the workqueue used for memcg kmem cache creation and
    deactivation. SLUB's memcg kmem cache deactivation also includes RCU
    callback and thus make sure all previous registered RCU callbacks have
    completed as well.

    [shakeelb@google.com: handle the RCU callbacks for SLUB deactivation]
    Link: http://lkml.kernel.org/r/20180611192951.195727-1-shakeelb@google.com
    [shakeelb@google.com: add more documentation, rename fields for readability]
    Link: http://lkml.kernel.org/r/20180522201336.196994-1-shakeelb@google.com
    [akpm@linux-foundation.org: fix build, per Shakeel]
    [shakeelb@google.com: v3. Instead of refcount, flush the workqueue]
    Link: http://lkml.kernel.org/r/20180530001204.183758-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20180521174116.171846-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

06 Apr, 2018

14 commits

  • should_failslab() is a convenient function to hook into for directed
    error injection into kmalloc(). However, it is only available if a
    config flag is set.

    The following BCC script, for example, fails kmalloc() calls after a
    btrfs umount:

    from bcc import BPF

    prog = r"""
    BPF_HASH(flag);

    #include

    int kprobe__btrfs_close_devices(void *ctx) {
    u64 key = 1;
    flag.update(&key, &key);
    return 0;
    }

    int kprobe__should_failslab(struct pt_regs *ctx) {
    u64 key = 1;
    u64 *res;
    res = flag.lookup(&key);
    if (res != 0) {
    bpf_override_return(ctx, -ENOMEM);
    }
    return 0;
    }
    """
    b = BPF(text=prog)

    while 1:
    b.kprobe_poll()

    This patch refactors the should_failslab implementation so that the
    function is always available for error injection, independent of flags.

    This change would be similar in nature to commit f5490d3ec921 ("block:
    Add should_fail_bio() for bpf error injection").

    Link: http://lkml.kernel.org/r/20180222020320.6944-1-hmclauchlan@fb.com
    Signed-off-by: Howard McLauchlan
    Reviewed-by: Andrew Morton
    Cc: Akinobu Mita
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Josef Bacik
    Cc: Johannes Weiner
    Cc: Alexei Starovoitov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Howard McLauchlan
     
  • Since commit db265eca7700 ("mm/sl[aou]b: Move duping of slab name to
    slab_common.c"), the kernel always duplicates the slab cache name when
    creating a slab cache, so the test if the slab name is accessible is
    useless.

    Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1803231133310.22626@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Mikulas Patocka
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • I have noticed on debug kernel with SLAB, the size of some non-root
    slabs were larger than their corresponding root slabs.

    e.g. for radix_tree_node:
    $cat /proc/slabinfo | grep radix
    name ...
    radix_tree_node 15052 15075 4096 1 1 ...

    $cat /cgroup/memory/temp/memory.kmem.slabinfo | grep radix
    name ...
    radix_tree_node 1581 158 4120 1 2 ...

    However for SLUB in debug kernel, the sizes were same. On further
    inspection it is found that SLUB always use kmem_cache.object_size to
    measure the kmem_cache.size while SLAB use the given kmem_cache.size.
    In the debug kernel the slab's size can be larger than its object_size.
    Thus in the creation of non-root slab, the SLAB uses the root's size as
    base to calculate the non-root slab's size and thus non-root slab's size
    can be larger than the root slab's size. For SLUB, the non-root slab's
    size is measured based on the root's object_size and thus the size will
    remain same for root and non-root slab.

    This patch makes slab's object_size the default base to measure the
    slab's size.

    Link: http://lkml.kernel.org/r/20180313165428.58699-1-shakeelb@google.com
    Fixes: 794b1248be4e ("memcg, slab: separate memcg vs root cache creation paths")
    Signed-off-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • SLAB doesn't support 4GB+ of objects per slab, therefore randomization
    doesn't need size_t.

    Link: http://lkml.kernel.org/r/20180305200730.15812-25-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • If kmem case sizes are 32-bit, then usecopy region should be too.

    Link: http://lkml.kernel.org/r/20180305200730.15812-21-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Cc: David Miller
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Linux doesn't support negative length objects.

    Link: http://lkml.kernel.org/r/20180305200730.15812-17-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • size_index_elem() always works with small sizes (kmalloc caches are
    32-bit) and returns small indexes.

    Link: http://lkml.kernel.org/r/20180305200730.15812-8-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • All those small numbers are reverse indexes into kmalloc caches array
    and can't be negative.

    On x86_64 "unsigned int = fls()" can drop CDQE instruction:

    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-2 (-2)
    Function old new delta
    kmalloc_slab 101 99 -2

    Link: http://lkml.kernel.org/r/20180305200730.15812-7-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • struct kmem_cache::size and ::align were always 32-bit.

    Out of curiosity I created 4GB kmem_cache, it oopsed with division by 0.
    kmem_cache_create(1UL<
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • struct kmem_cache::size has always been "int", all those
    "size_t size" are fake.

    Link: http://lkml.kernel.org/r/20180305200730.15812-5-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • KMALLOC_MAX_CACHE_SIZE is 32-bit so is the largest kmalloc cache size.

    Christoph said:
    :
    : Ok SLABs maximum allocation size is limited to 32M (see
    : include/linux/slab.h:
    :
    : #define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT - 1)
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • kmalloc_size() derives size of kmalloc cache from internal index, which
    can't be negative.

    Propagate unsignedness a bit.

    Link: http://lkml.kernel.org/r/20180305200730.15812-3-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Link: http://lkml.kernel.org/r/20180305200730.15812-1-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • kmalloc caches aren't relocated after being set up neither does
    "size_index" array.

    Link: http://lkml.kernel.org/r/20180226203519.GA6886@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

04 Feb, 2018

1 commit

  • Pull hardened usercopy whitelisting from Kees Cook:
    "Currently, hardened usercopy performs dynamic bounds checking on slab
    cache objects. This is good, but still leaves a lot of kernel memory
    available to be copied to/from userspace in the face of bugs.

    To further restrict what memory is available for copying, this creates
    a way to whitelist specific areas of a given slab cache object for
    copying to/from userspace, allowing much finer granularity of access
    control.

    Slab caches that are never exposed to userspace can declare no
    whitelist for their objects, thereby keeping them unavailable to
    userspace via dynamic copy operations. (Note, an implicit form of
    whitelisting is the use of constant sizes in usercopy operations and
    get_user()/put_user(); these bypass all hardened usercopy checks since
    these sizes cannot change at runtime.)

    This new check is WARN-by-default, so any mistakes can be found over
    the next several releases without breaking anyone's system.

    The series has roughly the following sections:
    - remove %p and improve reporting with offset
    - prepare infrastructure and whitelist kmalloc
    - update VFS subsystem with whitelists
    - update SCSI subsystem with whitelists
    - update network subsystem with whitelists
    - update process memory with whitelists
    - update per-architecture thread_struct with whitelists
    - update KVM with whitelists and fix ioctl bug
    - mark all other allocations as not whitelisted
    - update lkdtm for more sensible test overage"

    * tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits)
    lkdtm: Update usercopy tests for whitelisting
    usercopy: Restrict non-usercopy caches to size 0
    kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
    kvm: whitelist struct kvm_vcpu_arch
    arm: Implement thread_struct whitelist for hardened usercopy
    arm64: Implement thread_struct whitelist for hardened usercopy
    x86: Implement thread_struct whitelist for hardened usercopy
    fork: Provide usercopy whitelisting for task_struct
    fork: Define usercopy region in thread_stack slab caches
    fork: Define usercopy region in mm_struct slab caches
    net: Restrict unwhitelisted proto caches to size 0
    sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
    sctp: Define usercopy region in SCTP proto slab cache
    caif: Define usercopy region in caif proto slab cache
    ip: Define usercopy region in IP proto slab cache
    net: Define usercopy region in struct proto slab cache
    scsi: Define usercopy region in scsi_sense_cache slab cache
    cifs: Define usercopy region in cifs_request slab cache
    vxfs: Define usercopy region in vxfs_inode slab cache
    ufs: Define usercopy region in ufs_inode_cache slab cache
    ...

    Linus Torvalds
     

01 Feb, 2018

1 commit

  • calculate_alignment() function is only used inside slab_common.c. So
    make it static and let the compiler do more optimizations.

    After this patch there's a small improvement in text and data size.

    $ gcc --version
    gcc (GCC) 7.2.1 20171128

    Before:
    text data bss dec hex filename
    9890457 3828702 1212364 14931523 e3d643 vmlinux

    After:
    text data bss dec hex filename
    9890437 3828670 1212364 14931471 e3d60f vmlinux

    Also I fixed a style problem reported by checkpatch.

    WARNING: Missing a blank line after declarations
    #53: FILE: mm/slab_common.c:286:
    + unsigned long ralign = cache_line_size();
    + while (size
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Byongho Lee
     

16 Jan, 2018

4 commits

  • With all known usercopied cache whitelists now defined in the
    kernel, switch the default usercopy region of kmem_cache_create()
    to size 0. Any new caches with usercopy regions will now need to use
    kmem_cache_create_usercopy() instead of kmem_cache_create().

    This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
    whitelisting code in the last public patch of grsecurity/PaX based on my
    understanding of the code. Changes or omissions from the original code are
    mine and don't reflect the original grsecurity/PaX code.

    Cc: David Windsor
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Signed-off-by: Kees Cook

    Kees Cook
     
  • Mark the kmalloc slab caches as entirely whitelisted. These caches
    are frequently used to fulfill kernel allocations that contain data
    to be copied to/from userspace. Internal-only uses are also common,
    but are scattered in the kernel. For now, mark all the kmalloc caches
    as whitelisted.

    This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
    whitelisting code in the last public patch of grsecurity/PaX based on my
    understanding of the code. Changes or omissions from the original code are
    mine and don't reflect the original grsecurity/PaX code.

    Signed-off-by: David Windsor
    [kees: merged in moved kmalloc hunks, adjust commit log]
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Cc: linux-xfs@vger.kernel.org
    Signed-off-by: Kees Cook
    Acked-by: Christoph Lameter

    David Windsor
     
  • This introduces CONFIG_HARDENED_USERCOPY_FALLBACK to control the
    behavior of hardened usercopy whitelist violations. By default, whitelist
    violations will continue to WARN() so that any bad or missing usercopy
    whitelists can be discovered without being too disruptive.

    If this config is disabled at build time or a system is booted with
    "slab_common.usercopy_fallback=0", usercopy whitelists will BUG() instead
    of WARN(). This is useful for admins that want to use usercopy whitelists
    immediately.

    Suggested-by: Matthew Garrett
    Signed-off-by: Kees Cook

    Kees Cook
     
  • This patch prepares the slab allocator to handle caches having annotations
    (useroffset and usersize) defining usercopy regions.

    This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
    whitelisting code in the last public patch of grsecurity/PaX based on
    my understanding of the code. Changes or omissions from the original
    code are mine and don't reflect the original grsecurity/PaX code.

    Currently, hardened usercopy performs dynamic bounds checking on slab
    cache objects. This is good, but still leaves a lot of kernel memory
    available to be copied to/from userspace in the face of bugs. To further
    restrict what memory is available for copying, this creates a way to
    whitelist specific areas of a given slab cache object for copying to/from
    userspace, allowing much finer granularity of access control. Slab caches
    that are never exposed to userspace can declare no whitelist for their
    objects, thereby keeping them unavailable to userspace via dynamic copy
    operations. (Note, an implicit form of whitelisting is the use of constant
    sizes in usercopy operations and get_user()/put_user(); these bypass
    hardened usercopy checks since these sizes cannot change at runtime.)

    To support this whitelist annotation, usercopy region offset and size
    members are added to struct kmem_cache. The slab allocator receives a
    new function, kmem_cache_create_usercopy(), that creates a new cache
    with a usercopy region defined, suitable for declaring spans of fields
    within the objects that get copied to/from userspace.

    In this patch, the default kmem_cache_create() marks the entire allocation
    as whitelisted, leaving it semantically unchanged. Once all fine-grained
    whitelists have been added (in subsequent patches), this will be changed
    to a usersize of 0, making caches created with kmem_cache_create() not
    copyable to/from userspace.

    After the entire usercopy whitelist series is applied, less than 15%
    of the slab cache memory remains exposed to potential usercopy bugs
    after a fresh boot:

    Total Slab Memory: 48074720
    Usercopyable Memory: 6367532 13.2%
    task_struct 0.2% 4480/1630720
    RAW 0.3% 300/96000
    RAWv6 2.1% 1408/64768
    ext4_inode_cache 3.0% 269760/8740224
    dentry 11.1% 585984/5273856
    mm_struct 29.1% 54912/188448
    kmalloc-8 100.0% 24576/24576
    kmalloc-16 100.0% 28672/28672
    kmalloc-32 100.0% 81920/81920
    kmalloc-192 100.0% 96768/96768
    kmalloc-128 100.0% 143360/143360
    names_cache 100.0% 163840/163840
    kmalloc-64 100.0% 167936/167936
    kmalloc-256 100.0% 339968/339968
    kmalloc-512 100.0% 350720/350720
    kmalloc-96 100.0% 455616/455616
    kmalloc-8192 100.0% 655360/655360
    kmalloc-1024 100.0% 812032/812032
    kmalloc-4096 100.0% 819200/819200
    kmalloc-2048 100.0% 1310720/1310720

    After some kernel build workloads, the percentage (mainly driven by
    dentry and inode caches expanding) drops under 10%:

    Total Slab Memory: 95516184
    Usercopyable Memory: 8497452 8.8%
    task_struct 0.2% 4000/1456000
    RAW 0.3% 300/96000
    RAWv6 2.1% 1408/64768
    ext4_inode_cache 3.0% 1217280/39439872
    dentry 11.1% 1623200/14608800
    mm_struct 29.1% 73216/251264
    kmalloc-8 100.0% 24576/24576
    kmalloc-16 100.0% 28672/28672
    kmalloc-32 100.0% 94208/94208
    kmalloc-192 100.0% 96768/96768
    kmalloc-128 100.0% 143360/143360
    names_cache 100.0% 163840/163840
    kmalloc-64 100.0% 245760/245760
    kmalloc-256 100.0% 339968/339968
    kmalloc-512 100.0% 350720/350720
    kmalloc-96 100.0% 563520/563520
    kmalloc-8192 100.0% 655360/655360
    kmalloc-1024 100.0% 794624/794624
    kmalloc-4096 100.0% 819200/819200
    kmalloc-2048 100.0% 1257472/1257472

    Signed-off-by: David Windsor
    [kees: adjust commit log, split out a few extra kmalloc hunks]
    [kees: add field names to function declarations]
    [kees: convert BUGs to WARNs and fail closed]
    [kees: add attack surface reduction analysis to commit log]
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Cc: linux-xfs@vger.kernel.org
    Signed-off-by: Kees Cook
    Acked-by: Christoph Lameter

    David Windsor
     

16 Nov, 2017

2 commits

  • Convert all allocations that used a NOTRACK flag to stop using it.

    Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Levin, Alexander (Sasha Levin)
     
  • Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON,
    etc).

    SLAB is bloated temporarily by switching to "unsigned long", but only
    temporarily.

    Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2
    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan