26 Oct, 2020

1 commit

  • Commit 453431a54934 ("mm, treewide: rename kzfree() to
    kfree_sensitive()") renamed kzfree() to kfree_sensitive(),
    but it left a compatibility definition of kzfree() to avoid
    being too disruptive.

    Since then a few more instances of kzfree() have slipped in.

    Just get rid of them and remove the compatibility definition
    once and for all.

    Signed-off-by: Eric Biggers
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

14 Oct, 2020

1 commit


08 Aug, 2020

3 commits

  • Instead of having two sets of kmem_caches: one for system-wide and
    non-accounted allocations and the second one shared by all accounted
    allocations, we can use just one.

    The idea is simple: space for obj_cgroup metadata can be allocated on
    demand and filled only for accounted allocations.

    It allows to remove a bunch of code which is required to handle kmem_cache
    clones for accounted allocations. There is no more need to create them,
    accumulate statistics, propagate attributes, etc. It's a quite
    significant simplification.

    Also, because the total number of slab_caches is reduced almost twice (not
    all kmem_caches have a memcg clone), some additional memory savings are
    expected. On my devvm it additionally saves about 3.5% of slab memory.

    [guro@fb.com: fix build on MIPS]
    Link: http://lkml.kernel.org/r/20200717214810.3733082-1-guro@fb.com

    Suggested-by: Johannes Weiner
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Naresh Kamboju
    Link: http://lkml.kernel.org/r/20200623174037.3951353-18-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This is fairly big but mostly red patch, which makes all accounted slab
    allocations use a single set of kmem_caches instead of creating a separate
    set for each memory cgroup.

    Because the number of non-root kmem_caches is now capped by the number of
    root kmem_caches, there is no need to shrink or destroy them prematurely.
    They can be perfectly destroyed together with their root counterparts.
    This allows to dramatically simplify the management of non-root
    kmem_caches and delete a ton of code.

    This patch performs the following changes:
    1) introduces memcg_params.memcg_cache pointer to represent the
    kmem_cache which will be used for all non-root allocations
    2) reuses the existing memcg kmem_cache creation mechanism
    to create memcg kmem_cache on the first allocation attempt
    3) memcg kmem_caches are named -memcg,
    e.g. dentry-memcg
    4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
    or schedule it's creation and return the root cache
    5) removes almost all non-root kmem_cache management code
    (separate refcounter, reparenting, shrinking, etc)
    6) makes slab debugfs to display root_mem_cgroup css id and never
    show :dead and :deact flags in the memcg_slabinfo attribute.

    Following patches in the series will simplify the kmem_cache creation.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-13-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • As said by Linus:

    A symmetric naming is only helpful if it implies symmetries in use.
    Otherwise it's actively misleading.

    In "kzalloc()", the z is meaningful and an important part of what the
    caller wants.

    In "kzfree()", the z is actively detrimental, because maybe in the
    future we really _might_ want to use that "memfill(0xdeadbeef)" or
    something. The "zero" part of the interface isn't even _relevant_.

    The main reason that kzfree() exists is to clear sensitive information
    that should not be leaked to other future users of the same memory
    objects.

    Rename kzfree() to kfree_sensitive() to follow the example of the recently
    added kvfree_sensitive() and make the intention of the API more explicit.
    In addition, memzero_explicit() is used to clear the memory to make sure
    that it won't get optimized away by the compiler.

    The renaming is done by using the command sequence:

    git grep -w --name-only kzfree |\
    xargs sed -i 's/kzfree/kfree_sensitive/'

    followed by some editing of the kfree_sensitive() kerneldoc and adding
    a kzfree backward compatibility macro in slab.h.

    [akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h]
    [akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more]

    Suggested-by: Joe Perches
    Signed-off-by: Waiman Long
    Signed-off-by: Andrew Morton
    Acked-by: David Howells
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Jarkko Sakkinen
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Joe Perches
    Cc: Matthew Wilcox
    Cc: David Rientjes
    Cc: Dan Carpenter
    Cc: "Jason A . Donenfeld"
    Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.com
    Signed-off-by: Linus Torvalds

    Waiman Long
     

11 Apr, 2020

1 commit

  • There is a typo at the cross-reference link, causing this warning:

    include/linux/slab.h:11: WARNING: undefined label: memory-allocation (if the link has no caption the label must precede a section header)

    Signed-off-by: Mauro Carvalho Chehab
    Signed-off-by: Andrew Morton
    Cc: Jonathan Corbet
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/0aeac24235d356ebd935d11e147dcc6edbb6465c.1586359676.git.mchehab+huawei@kernel.org
    Signed-off-by: Linus Torvalds

    Mauro Carvalho Chehab
     

04 Feb, 2020

1 commit

  • Since 5.5-rc1 the last user of this function is gone, so remove the
    functionality.

    See commit
    2ad9d7747c10 ("netfilter: conntrack: free extension area immediately")
    for details.

    Link: http://lkml.kernel.org/r/20191212223442.22141-1-fw@strlen.de
    Signed-off-by: Florian Westphal
    Acked-by: Andrew Morton
    Acked-by: David Rientjes
    Reviewed-by: David Hildenbrand
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Florian Westphal
     

01 Dec, 2019

1 commit

  • The size of kmalloc can be obtained from kmalloc_info[], so remove
    kmalloc_size() that will not be used anymore.

    Link: http://lkml.kernel.org/r/1569241648-26908-3-git-send-email-lpf.vector@gmail.com
    Signed-off-by: Pengfei Li
    Acked-by: Vlastimil Babka
    Acked-by: Roman Gushchin
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pengfei Li
     

08 Oct, 2019

1 commit

  • In most configurations, kmalloc() happens to return naturally aligned
    (i.e. aligned to the block size itself) blocks for power of two sizes.

    That means some kmalloc() users might unknowingly rely on that
    alignment, until stuff breaks when the kernel is built with e.g.
    CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
    developers have to devise workaround such as own kmem caches with
    specified alignment [1], which is not always practical, as recently
    evidenced in [2].

    The topic has been discussed at LSF/MM 2019 [3]. Adding a
    'kmalloc_aligned()' variant would not help with code unknowingly relying
    on the implicit alignment. For slab implementations it would either
    require creating more kmalloc caches, or allocate a larger size and only
    give back part of it. That would be wasteful, especially with a generic
    alignment parameter (in contrast with a fixed alignment to size).

    Ideally we should provide to mm users what they need without difficult
    workarounds or own reimplementations, so let's make the kmalloc()
    alignment to size explicitly guaranteed for power-of-two sizes under all
    configurations. What this means for the three available allocators?

    * SLAB object layout happens to be mostly unchanged by the patch. The
    implicitly provided alignment could be compromised with
    CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
    caches with alignment larger than unsigned long long. Practically on at
    least x86 this includes kmalloc caches as they use cache line alignment,
    which is larger than that. Still, this patch ensures alignment on all
    arches and cache sizes.

    * SLUB layout is also unchanged unless redzoning is enabled through
    CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
    With this patch, explicit alignment is guaranteed with redzoning as
    well. This will result in more memory being wasted, but that should be
    acceptable in a debugging scenario.

    * SLOB has no implicit alignment so this patch adds it explicitly for
    kmalloc(). The potential downside is increased fragmentation. While
    pathological allocation scenarios are certainly possible, in my testing,
    after booting a x86_64 kernel+userspace with virtme, around 16MB memory
    was consumed by slab pages both before and after the patch, with
    difference in the noise.

    [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
    [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
    [3] https://lwn.net/Articles/787740/

    [akpm@linux-foundation.org: documentation fixlet, per Matthew]
    Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Acked-by: Christoph Hellwig
    Cc: David Sterba
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Ming Lei
    Cc: Dave Chinner
    Cc: "Darrick J . Wong"
    Cc: Christoph Hellwig
    Cc: James Bottomley
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

25 Sep, 2019

1 commit

  • The memcg_cache_params structure is only embedded into the kmem_cache of
    slab and slub allocators as defined in slab_def.h and slub_def.h and used
    internally by mm code. There is no needed to expose it in a public
    header. So move it from include/linux/slab.h to mm/slab.h. It is just a
    refactoring patch with no code change.

    In fact both the slub_def.h and slab_def.h should be moved into the mm
    directory as well, but that will probably cause many merge conflicts.

    Link: http://lkml.kernel.org/r/20190718180827.18758-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     

13 Jul, 2019

5 commits

  • There are concerns about memory leaks from extensive use of memory cgroups
    as each memory cgroup creates its own set of kmem caches. There is a
    possiblity that the memcg kmem caches may remain even after the memory
    cgroups have been offlined. Therefore, it will be useful to show the
    status of each of memcg kmem caches.

    This patch introduces a new /memcg_slabinfo file which is
    somewhat similar to /proc/slabinfo in format, but lists only information
    about kmem caches that have child memcg kmem caches. Information
    available in /proc/slabinfo are not repeated in memcg_slabinfo.

    A portion of a sample output of the file was:

    #
    rpc_inode_cache root 13 51 1 1
    rpc_inode_cache 48 0 0 0 0
    fat_inode_cache root 1 45 1 1
    fat_inode_cache 41 2 45 1 1
    xfs_inode root 770 816 24 24
    xfs_inode 92 22 34 1 1
    xfs_inode 88:dead 1 34 1 1
    xfs_inode 89:dead 23 34 1 1
    xfs_inode 85 4 34 1 1
    xfs_inode 84 9 34 1 1

    The css id of the memcg is also listed. If a memcg is not online,
    the tag ":dead" will be attached as shown above.

    [longman@redhat.com: memcg: add ":deact" tag for reparented kmem caches in memcg_slabinfo]
    Link: http://lkml.kernel.org/r/20190621173005.31514-1-longman@redhat.com
    [longman@redhat.com: set the flag in the common code as suggested by Roman]
    Link: http://lkml.kernel.org/r/20190627184324.5875-1-longman@redhat.com
    Link: http://lkml.kernel.org/r/20190619171621.26209-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Suggested-by: Shakeel Butt
    Reviewed-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Let's reparent non-root kmem_caches on memcg offlining. This allows us to
    release the memory cgroup without waiting for the last outstanding kernel
    object (e.g. dentry used by another application).

    Since the parent cgroup is already charged, everything we need to do is to
    splice the list of kmem_caches to the parent's kmem_caches list, swap the
    memcg pointer, drop the css refcounter for each kmem_cache and adjust the
    parent's css refcounter.

    Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer
    anymore. It's safe to read it under rcu_read_lock(), cgroup_mutex held,
    or any other way that protects the memory cgroup from being released.

    We can race with the slab allocation and deallocation paths. It's not a
    big problem: parent's charge and slab global stats are always correct, and
    we don't care anymore about the child usage and global stats. The child
    cgroup is already offline, so we don't use or show it anywhere.

    Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't
    used anywhere except count_shadow_nodes(). But even there it won't break
    anything: after reparenting "nodes" will be 0 on child level (because
    we're already reparenting shrinker lists), and on parent level page stats
    always were 0, and this patch won't change anything.

    [guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup]
    Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently each charged slab page holds a reference to the cgroup to which
    it's charged. Kmem_caches are held by the memcg and are released all
    together with the memory cgroup. It means that none of kmem_caches are
    released unless at least one reference to the memcg exists, which is very
    far from optimal.

    Let's rework it in a way that allows releasing individual kmem_caches as
    soon as the cgroup is offline, the kmem_cache is empty and there are no
    pending allocations.

    To make it possible, let's introduce a new percpu refcounter for non-root
    kmem caches. The counter is initialized to the percpu mode, and is
    switched to the atomic mode during kmem_cache deactivation. The counter
    is bumped for every charged page and also for every running allocation.
    So the kmem_cache can't be released unless all allocations complete.

    To shutdown non-active empty kmem_caches, let's reuse the work queue,
    previously used for the kmem_cache deactivation. Once the reference
    counter reaches 0, let's schedule an asynchronous kmem_cache release.

    * I used the following simple approach to test the performance
    (stolen from another patchset by T. Harding):

    time find / -name fname-no-exist
    echo 2 > /proc/sys/vm/drop_caches
    repeat 10 times

    Results:

    orig patched

    real 0m1.455s real 0m1.355s
    user 0m0.206s user 0m0.219s
    sys 0m0.855s sys 0m0.807s

    real 0m1.487s real 0m1.699s
    user 0m0.221s user 0m0.256s
    sys 0m0.806s sys 0m0.948s

    real 0m1.515s real 0m1.505s
    user 0m0.183s user 0m0.215s
    sys 0m0.876s sys 0m0.858s

    real 0m1.291s real 0m1.380s
    user 0m0.193s user 0m0.198s
    sys 0m0.843s sys 0m0.786s

    real 0m1.364s real 0m1.374s
    user 0m0.180s user 0m0.182s
    sys 0m0.868s sys 0m0.806s

    real 0m1.352s real 0m1.312s
    user 0m0.201s user 0m0.212s
    sys 0m0.820s sys 0m0.761s

    real 0m1.302s real 0m1.349s
    user 0m0.205s user 0m0.203s
    sys 0m0.803s sys 0m0.792s

    real 0m1.334s real 0m1.301s
    user 0m0.194s user 0m0.201s
    sys 0m0.806s sys 0m0.779s

    real 0m1.426s real 0m1.434s
    user 0m0.216s user 0m0.181s
    sys 0m0.824s sys 0m0.864s

    real 0m1.350s real 0m1.295s
    user 0m0.200s user 0m0.190s
    sys 0m0.842s sys 0m0.811s

    So it looks like the difference is not noticeable in this test.

    [cai@lca.pw: fix an use-after-free in kmemcg_workfn()]
    Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw
    Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.com
    Signed-off-by: Roman Gushchin
    Signed-off-by: Qian Cai
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Shakeel Butt
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The delayed work/rcu deactivation infrastructure of non-root kmem_caches
    can be also used for asynchronous release of these objects. Let's get rid
    of the word "deactivation" in corresponding names to make the code look
    better after generalization.

    It's easier to make the renaming first, so that the generalized code will
    look consistent from scratch.

    Let's rename struct memcg_cache_params fields:
    deact_fn -> work_fn
    deact_rcu_head -> rcu_head
    deact_work -> work

    And RCU/delayed work callbacks in slab common code:
    kmemcg_deactivate_rcufn -> kmemcg_rcufn
    kmemcg_deactivate_workfn -> kmemcg_workfn

    This patch contains no functional changes, only renamings.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-3-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This refactors common code of ksize() between the various allocators into
    slab_common.c: __ksize() is the allocator-specific implementation without
    instrumentation, whereas ksize() includes the required KASAN logic.

    Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.com
    Signed-off-by: Marco Elver
    Acked-by: Christoph Lameter
    Reviewed-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mark Rutland
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marco Elver
     

30 Mar, 2019

1 commit

  • Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
    v6.

    This is a followup to the discussion in [1], [2].

    IOMMUs using ARMv7 short-descriptor format require page tables (level 1
    and 2) to be allocated within the first 4GB of RAM, even on 64-bit
    systems.

    For L1 tables that are bigger than a page, we can just use
    __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
    use GFP_DMA).

    For L2 tables that only take 1KB, it would be a waste to allocate a full
    page, so we considered 3 approaches:
    1. This series, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2 page
    tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable to reuse
    freed fragments until the whole page is freed. [3]

    This series is the most memory-efficient approach.

    stable@ note:
    We confirmed that this is a regression, and IOMMU errors happen on 4.19
    and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
    most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
    with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
    platforms (and maybe others?).

    [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
    [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
    [3] https://patchwork.codeaurora.org/patch/671639/

    This patch (of 3):

    IOMMUs using ARMv7 short-descriptor format require page tables to be
    allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
    this is done by passing GFP_DMA32 flag to memory allocation functions.

    For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
    a full page using get_free_pages, so we considered 3 approaches:
    1. This patch, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2
    page tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable
    to reuse freed fragments until the whole page is freed.

    This change makes it possible to create a custom cache in DMA32 zone using
    kmem_cache_create, then allocate memory using kmem_cache_alloc.

    We do not create a DMA32 kmalloc cache array, as there are currently no
    users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
    warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.

    This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
    kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
    unnecessary).

    Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
    Signed-off-by: Nicolas Boichat
    Acked-by: Vlastimil Babka
    Acked-by: Will Deacon
    Cc: Robin Murphy
    Cc: Joerg Roedel
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Sasha Levin
    Cc: Huaisheng Ye
    Cc: Mike Rapoport
    Cc: Yong Wu
    Cc: Matthias Brugger
    Cc: Tomasz Figa
    Cc: Yingjoe Chen
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Hsin-Yi Wang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Boichat
     

30 Dec, 2018

1 commit

  • Pull documentation update from Jonathan Corbet:
    "A fairly normal cycle for documentation stuff. We have a new document
    on perf security, more Italian translations, more improvements to the
    memory-management docs, improvements to the pathname lookup
    documentation, and the usual array of smaller fixes.

    As is often the case, there are a few reaches outside of
    Documentation/ to adjust kerneldoc comments"

    * tag 'docs-5.0' of git://git.lwn.net/linux: (38 commits)
    docs: improve pathname-lookup document structure
    configfs: fix wrong name of struct in documentation
    docs/mm-api: link slab_common.c to "The Slab Cache" section
    slab: make kmem_cache_create{_usercopy} description proper kernel-doc
    doc:process: add links where missing
    docs/core-api: make mm-api.rst more structured
    x86, boot: documentation whitespace fixup
    Documentation: devres: note checking needs when converting
    doc:it: add some process/* translations
    doc:it: fixes in process/1.Intro
    Documentation: convert path-lookup from markdown to resturctured text
    Documentation/admin-guide: update admin-guide index.rst
    Documentation/admin-guide: introduce perf-security.rst file
    scripts/kernel-doc: Fix struct and struct field attribute processing
    Documentation: dev-tools: Fix typos in index.rst
    Correct gen_init_cpio tool's documentation
    Document /proc/pid PID reuse behavior
    Documentation: update path-lookup.md for parallel lookups
    Documentation: Use "while" instead of "whilst"
    dmaengine: Add mailing list address to the documentation
    ...

    Linus Torvalds
     

29 Dec, 2018

2 commits

  • Multiple people have reported the following sparse warning:

    ./include/linux/slab.h:332:43: warning: dubious: x & !y

    The minimal fix would be to change the logical & to boolean &&, which
    emits the same code, but Andrew has suggested that the branch-avoiding
    tricks are maybe not worthwile. David Laight provided a nice comparison
    of disassembly of multiple variants, which shows that the current version
    produces a 4 deep dependency chain, and fixing the sparse warning by
    changing logical and to multiplication emits an IMUL, making it even more
    expensive.

    The code as rewritten by this patch yielded the best disassembly, with a
    single predictable branch for the most common case, and a ternary operator
    for the rest, which gcc seems to compile without a branch or cmov by
    itself.

    The result should be more readable, without a sparse warning and probably
    also faster for the common case.

    Link: http://lkml.kernel.org/r/80340595-d7c5-97b9-4f6c-23fa893a91e9@suse.cz
    Fixes: 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable caches")
    Reviewed-by: Andrew Morton
    Signed-off-by: Vlastimil Babka
    Reported-by: Bart Van Assche
    Reported-by: Darryl T. Agostinelli
    Reported-by: Masahiro Yamada
    Suggested-by: Andrew Morton
    Suggested-by: David Laight
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "kasan: add software tag-based mode for arm64", v13.

    This patchset adds a new software tag-based mode to KASAN [1]. (Initially
    this mode was called KHWASAN, but it got renamed, see the naming rationale
    at the end of this section).

    The plan is to implement HWASan [2] for the kernel with the incentive,
    that it's going to have comparable to KASAN performance, but in the same
    time consume much less memory, trading that off for somewhat imprecise bug
    detection and being supported only for arm64.

    The underlying ideas of the approach used by software tag-based KASAN are:

    1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
    pointer tags in the top byte of each kernel pointer.

    2. Using shadow memory, we can store memory tags for each chunk of kernel
    memory.

    3. On each memory allocation, we can generate a random tag, embed it into
    the returned pointer and set the memory tags that correspond to this
    chunk of memory to the same value.

    4. By using compiler instrumentation, before each memory access we can add
    a check that the pointer tag matches the tag of the memory that is being
    accessed.

    5. On a tag mismatch we report an error.

    With this patchset the existing KASAN mode gets renamed to generic KASAN,
    with the word "generic" meaning that the implementation can be supported
    by any architecture as it is purely software.

    The new mode this patchset adds is called software tag-based KASAN. The
    word "tag-based" refers to the fact that this mode uses tags embedded into
    the top byte of kernel pointers and the TBI arm64 CPU feature that allows
    to dereference such pointers. The word "software" here means that shadow
    memory manipulation and tag checking on pointer dereference is done in
    software. As it is the only tag-based implementation right now, "software
    tag-based" KASAN is sometimes referred to as simply "tag-based" in this
    patchset.

    A potential expansion of this mode is a hardware tag-based mode, which
    would use hardware memory tagging support (announced by Arm [3]) instead
    of compiler instrumentation and manual shadow memory manipulation.

    Same as generic KASAN, software tag-based KASAN is strictly a debugging
    feature.

    [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html

    [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html

    [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a

    ====== Rationale

    On mobile devices generic KASAN's memory usage is significant problem.
    One of the main reasons to have tag-based KASAN is to be able to perform a
    similar set of checks as the generic one does, but with lower memory
    requirements.

    Comment from Vishwath Mohan :

    I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
    problematic to enable for environments that don't tolerate the increased
    memory pressure well. This includes

    (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
    (c) Connected components like Pixel's visual core [1].

    These are both places I'd love to have a low(er) memory footprint option at
    my disposal.

    Comment from Evgenii Stepanov :

    Looking at a live Android device under load, slab (according to
    /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
    overhead of 2x - 3x on top of it is not insignificant.

    Not having this overhead enables near-production use - ex. running
    KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
    not reproduce in test configuration. These are the ones that often cost
    the most engineering time to track down.

    CPU overhead is bad, but generally tolerable. RAM is critical, in our
    experience. Once it gets low enough, OOM-killer makes your life
    miserable.

    [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/

    ====== Technical details

    Software tag-based KASAN mode is implemented in a very similar way to the
    generic one. This patchset essentially does the following:

    1. TCR_TBI1 is set to enable Top Byte Ignore.

    2. Shadow memory is used (with a different scale, 1:16, so each shadow
    byte corresponds to 16 bytes of kernel memory) to store memory tags.

    3. All slab objects are aligned to shadow scale, which is 16 bytes.

    4. All pointers returned from the slab allocator are tagged with a random
    tag and the corresponding shadow memory is poisoned with the same value.

    5. Compiler instrumentation is used to insert tag checks. Either by
    calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
    CONFIG_KASAN_INLINE flags are reused).

    6. When a tag mismatch is detected in callback instrumentation mode
    KASAN simply prints a bug report. In case of inline instrumentation,
    clang inserts a brk instruction, and KASAN has it's own brk handler,
    which reports the bug.

    7. The memory in between slab objects is marked with a reserved tag, and
    acts as a redzone.

    8. When a slab object is freed it's marked with a reserved tag.

    Bug detection is imprecise for two reasons:

    1. We won't catch some small out-of-bounds accesses, that fall into the
    same shadow cell, as the last byte of a slab object.

    2. We only have 1 byte to store tags, which means we have a 1/256
    probability of a tag match for an incorrect access (actually even
    slightly less due to reserved tag values).

    Despite that there's a particular type of bugs that tag-based KASAN can
    detect compared to generic KASAN: use-after-free after the object has been
    allocated by someone else.

    ====== Testing

    Some kernel developers voiced a concern that changing the top byte of
    kernel pointers may lead to subtle bugs that are difficult to discover.
    To address this concern deliberate testing has been performed.

    It doesn't seem feasible to do some kind of static checking to find
    potential issues with pointer tagging, so a dynamic approach was taken.
    All pointer comparisons/subtractions have been instrumented in an LLVM
    compiler pass and a kernel module that would print a bug report whenever
    two pointers with different tags are being compared/subtracted (ignoring
    comparisons with NULL pointers and with pointers obtained by casting an
    error code to a pointer type) has been used. Then the kernel has been
    booted in QEMU and on an Odroid C2 board and syzkaller has been run.

    This yielded the following results.

    The two places that look interesting are:

    is_vmalloc_addr in include/linux/mm.h
    is_kernel_rodata in mm/util.c

    Here we compare a pointer with some fixed untagged values to make sure
    that the pointer lies in a particular part of the kernel address space.
    Since tag-based KASAN doesn't add tags to pointers that belong to rodata
    or vmalloc regions, this should work as is. To make sure debug checks to
    those two functions that check that the result doesn't change whether we
    operate on pointers with or without untagging has been added.

    A few other cases that don't look that interesting:

    Comparing pointers to achieve unique sorting order of pointee objects
    (e.g. sorting locks addresses before performing a double lock):

    tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
    pipe_double_lock in fs/pipe.c
    unix_state_double_lock in net/unix/af_unix.c
    lock_two_nondirectories in fs/inode.c
    mutex_lock_double in kernel/events/core.c

    ep_cmp_ffd in fs/eventpoll.c
    fsnotify_compare_groups fs/notify/mark.c

    Nothing needs to be done here, since the tags embedded into pointers
    don't change, so the sorting order would still be unique.

    Checks that a pointer belongs to some particular allocation:

    is_sibling_entry in lib/radix-tree.c
    object_is_on_stack in include/linux/sched/task_stack.h

    Nothing needs to be done here either, since two pointers can only belong
    to the same allocation if they have the same tag.

    Overall, since the kernel boots and works, there are no critical bugs.
    As for the rest, the traditional kernel testing way (use until fails) is
    the only one that looks feasible.

    Another point here is that tag-based KASAN is available under a separate
    config option that needs to be deliberately enabled. Even though it might
    be used in a "near-production" environment to find bugs that are not found
    during fuzzing or running tests, it is still a debug tool.

    ====== Benchmarks

    The following numbers were collected on Odroid C2 board. Both generic and
    tag-based KASAN were used in inline instrumentation mode.

    Boot time [1]:
    * ~1.7 sec for clean kernel
    * ~5.0 sec for generic KASAN
    * ~5.0 sec for tag-based KASAN

    Network performance [2]:
    * 8.33 Gbits/sec for clean kernel
    * 3.17 Gbits/sec for generic KASAN
    * 2.85 Gbits/sec for tag-based KASAN

    Slab memory usage after boot [3]:
    * ~40 kb for clean kernel
    * ~105 kb (~260% overhead) for generic KASAN
    * ~47 kb (~20% overhead) for tag-based KASAN

    KASAN memory overhead consists of three main parts:
    1. Increased slab memory usage due to redzones.
    2. Shadow memory (the whole reserved once during boot).
    3. Quaratine (grows gradually until some preset limit; the more the limit,
    the more the chance to detect a use-after-free).

    Comparing tag-based vs generic KASAN for each of these points:
    1. 20% vs 260% overhead.
    2. 1/16th vs 1/8th of physical memory.
    3. Tag-based KASAN doesn't require quarantine.

    [1] Time before the ext4 driver is initialized.
    [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
    [3] Measured as `cat /proc/meminfo | grep Slab`.

    ====== Some notes

    A few notes:

    1. The patchset can be found here:
    https://github.com/xairy/kasan-prototype/tree/khwasan

    2. Building requires a recent Clang version (7.0.0 or later).

    3. Stack instrumentation is not supported yet and will be added later.

    This patch (of 25):

    Tag-based KASAN changes the value of the top byte of pointers returned
    from the kernel allocation functions (such as kmalloc). This patch
    updates KASAN hooks signatures and their usage in SLAB and SLUB code to
    reflect that.

    Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Mark Rutland
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

21 Nov, 2018

2 commits


27 Oct, 2018

2 commits

  • Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
    indicates they contain objects which can be reclaimed under memory
    pressure (typically through a shrinker). This makes the slab pages
    accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
    MemAvailable meminfo counter and in overcommit decisions. The slab pages
    are also allocated with __GFP_RECLAIMABLE, which is good for
    anti-fragmentation through grouping pages by mobility.

    The generic kmalloc-X caches are created without this flag, but sometimes
    are used also for objects that can be reclaimed, which due to varying size
    cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
    prominent example are dcache external names, which prompted the creation
    of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
    in commit f1782c9bc547 ("dcache: account external names as indirectly
    reclaimable memory").

    To better handle this and any other similar cases, this patch introduces
    SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
    They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
    gfp flags. They are added to the kmalloc_caches array as a new type.
    Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
    cache.

    This change only applies to SLAB and SLUB, not SLOB. This is fine, since
    SLOB's target are tiny system and this patch does add some overhead of
    kmem management objects.

    Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Laura Abbott
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Sumit Semwal
    Cc: Vijayanand Jitta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "kmalloc-reclaimable caches", v4.

    As discussed at LSF/MM [1] here's a patchset that introduces
    kmalloc-reclaimable caches (more details in the second patch) and uses
    them for dcache external names. That allows us to repurpose the
    NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.

    With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
    caches, eliminating the need for manual accounting. More importantly, it
    also ensures the reclaimable kmalloc allocations are grouped in pages
    separate from the regular kmalloc allocations. The need for proper
    accounting of dcache external names has shown it's easy for misbehaving
    process to allocate lots of them, causing premature OOMs. Without the
    added grouping, it's likely that a similar workload can interleave the
    dcache external names allocations with regular kmalloc allocations (note:
    I haven't searched myself for an example of such regular kmalloc
    allocation, but I would be very surprised if there wasn't some). A
    pathological case would be e.g. one 64byte regular allocations with 63
    external dcache names in a page (64x64=4096), which means the page is not
    freed even after reclaiming after all dcache names, and the process can
    thus "steal" the whole page with single 64byte allocation.

    If other kmalloc users similar to dcache external names become identified,
    they can also benefit from the new functionality simply by adding
    __GFP_RECLAIMABLE to the kmalloc calls.

    Side benefits of the patchset (that could be also merged separately)
    include removed branch for detecting __GFP_DMA kmalloc(), and shortening
    kmalloc cache names in /proc/slabinfo output. The latter is potentially
    an ABI break in case there are tools parsing the names and expecting the
    values to be in bytes.

    This is how /proc/slabinfo looks like after booting in virtme:

    ...
    kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
    ...
    kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
    kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
    kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
    kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
    kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
    kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
    ...

    /proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:

    ...
    nr_slab_reclaimable 2817
    nr_slab_unreclaimable 1781
    ...
    nr_kernel_misc_reclaimable 0
    ...

    /proc/meminfo with new KReclaimable counter:

    ...
    Shmem: 564 kB
    KReclaimable: 11260 kB
    Slab: 18368 kB
    SReclaimable: 11260 kB
    SUnreclaim: 7108 kB
    KernelStack: 1248 kB
    ...

    This patch (of 6):

    The kmalloc caches currently mainain separate (optional) array
    kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
    __GFP_DMA in the allocation hotpaths. We can avoid the branches by
    combining kmalloc_caches and kmalloc_dma_caches into a single
    two-dimensional array where the outer dimension is cache "type". This
    will also allow to add kmalloc-reclaimable caches as a third type.

    Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Laura Abbott
    Cc: Sumit Semwal
    Cc: Vijayanand Jitta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

18 Aug, 2018

1 commit

  • Introduce new config option, which is used to replace repeating
    CONFIG_MEMCG && !CONFIG_SLOB pattern. Next patches add a little more
    memcg+kmem related code, so let's keep the defines more clearly.

    Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

15 Jun, 2018

1 commit

  • The memcg kmem cache creation and deactivation (SLUB only) is
    asynchronous. If a root kmem cache is destroyed whose memcg cache is in
    the process of creation or deactivation, the kernel may crash.

    Example of one such crash:
    general protection fault: 0000 [#1] SMP PTI
    CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
    ...
    Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
    RIP: 0010:has_cpu_slab
    ...
    Call Trace:
    ? on_each_cpu_cond
    __kmem_cache_shrink
    kmemcg_cache_deact_after_rcu
    kmemcg_deactivate_workfn
    process_one_work
    worker_thread
    kthread
    ret_from_fork+0x35/0x40

    To fix this race, on root kmem cache destruction, mark the cache as
    dying and flush the workqueue used for memcg kmem cache creation and
    deactivation. SLUB's memcg kmem cache deactivation also includes RCU
    callback and thus make sure all previous registered RCU callbacks have
    completed as well.

    [shakeelb@google.com: handle the RCU callbacks for SLUB deactivation]
    Link: http://lkml.kernel.org/r/20180611192951.195727-1-shakeelb@google.com
    [shakeelb@google.com: add more documentation, rename fields for readability]
    Link: http://lkml.kernel.org/r/20180522201336.196994-1-shakeelb@google.com
    [akpm@linux-foundation.org: fix build, per Shakeel]
    [shakeelb@google.com: v3. Instead of refcount, flush the workqueue]
    Link: http://lkml.kernel.org/r/20180530001204.183758-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20180521174116.171846-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

06 Jun, 2018

1 commit


06 Apr, 2018

5 commits

  • Currently #includes for no obvious
    reason. It looks like it's only a convenience, so remove kmemleak.h
    from slab.h and add to any users of kmemleak_* that
    don't already #include it. Also remove from source
    files that do not use it.

    This is tested on i386 allmodconfig and x86_64 allmodconfig. It would
    be good to run it through the 0day bot for other $ARCHes. I have
    neither the horsepower nor the storage space for the other $ARCHes.

    Update: This patch has been extensively build-tested by both the 0day
    bot & kisskb/ozlabs build farms. Both of them reported 2 build failures
    for which patches are included here (in v2).

    [ slab.h is the second most used header file after module.h; kernel.h is
    right there with slab.h. There could be some minor error in the
    counting due to some #includes having comments after them and I didn't
    combine all of those. ]

    [akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
    Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
    Link: http://kisskb.ellerman.id.au/kisskb/head/13396/
    Signed-off-by: Randy Dunlap
    Reviewed-by: Ingo Molnar
    Reported-by: Michael Ellerman [2 build failures]
    Reported-by: Fengguang Wu [2 build failures]
    Reviewed-by: Andrew Morton
    Cc: Wei Yongjun
    Cc: Luis R. Rodriguez
    Cc: Greg Kroah-Hartman
    Cc: Mimi Zohar
    Cc: John Johansen
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • If kmem case sizes are 32-bit, then usecopy region should be too.

    Link: http://lkml.kernel.org/r/20180305200730.15812-21-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Cc: David Miller
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • struct kmem_cache::size and ::align were always 32-bit.

    Out of curiosity I created 4GB kmem_cache, it oopsed with division by 0.
    kmem_cache_create(1UL<
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • kmalloc_size() derives size of kmalloc cache from internal index, which
    can't be negative.

    Propagate unsignedness a bit.

    Link: http://lkml.kernel.org/r/20180305200730.15812-3-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • kmalloc_index() return index into an array of kmalloc kmem caches,
    therefore should be unsigned.

    Space savings with SLUB on trimmed down .config:

    add/remove: 0/1 grow/shrink: 6/56 up/down: 85/-557 (-472)
    Function old new delta
    calculate_sizes 924 983 +59
    on_freelist 589 604 +15
    init_cache_random_seq 122 127 +5
    ext4_mb_init 1206 1210 +4
    slab_pad_check.part 270 271 +1
    cpu_partial_store 112 113 +1
    usersize_show 28 27 -1
    ...
    new_slab 1871 1837 -34
    slab_order 204 - -204

    This patch start a series of converting SLUB (mostly) to "unsigned int".

    1) Most integers in the code are in fact unsigned entities: array
    indexes, lengths, buffer sizes, allocation orders. It is therefore
    better to use unsigned variables

    2) Some integers in the code are either "size_t" or "unsigned long" for
    no reason.

    size_t usually comes from people trying to maintain type correctness
    and figuring out that "sizeof" operator returns size_t or
    memset/memcpy takes size_t so should everything passed to it.

    However the number of 4GB+ objects in the kernel is very small. Most,
    if not all, dynamically allocated objects with kmalloc() or
    kmem_cache_create() aren't actually big. Maintaining wide types
    doesn't do anything.

    64-bit ops are bigger than 32-bit on our beloved x86_64,
    so try to not use 64-bit where it isn't necessary
    (read: everywhere where integers are integers not pointers)

    3) in case of SLAB allocators, there are additional limitations
    *) page->inuse, page->objects are only 16-/15-bit,
    *) cache size was always 32-bit
    *) slab orders are small, order 20 is needed to go 64-bit on x86_64
    (PAGE_SIZE << order)

    Basically everything is 32-bit except kmalloc(1ULL<
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

16 Jan, 2018

3 commits

  • This introduces CONFIG_HARDENED_USERCOPY_FALLBACK to control the
    behavior of hardened usercopy whitelist violations. By default, whitelist
    violations will continue to WARN() so that any bad or missing usercopy
    whitelists can be discovered without being too disruptive.

    If this config is disabled at build time or a system is booted with
    "slab_common.usercopy_fallback=0", usercopy whitelists will BUG() instead
    of WARN(). This is useful for admins that want to use usercopy whitelists
    immediately.

    Suggested-by: Matthew Garrett
    Signed-off-by: Kees Cook

    Kees Cook
     
  • This patch prepares the slab allocator to handle caches having annotations
    (useroffset and usersize) defining usercopy regions.

    This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
    whitelisting code in the last public patch of grsecurity/PaX based on
    my understanding of the code. Changes or omissions from the original
    code are mine and don't reflect the original grsecurity/PaX code.

    Currently, hardened usercopy performs dynamic bounds checking on slab
    cache objects. This is good, but still leaves a lot of kernel memory
    available to be copied to/from userspace in the face of bugs. To further
    restrict what memory is available for copying, this creates a way to
    whitelist specific areas of a given slab cache object for copying to/from
    userspace, allowing much finer granularity of access control. Slab caches
    that are never exposed to userspace can declare no whitelist for their
    objects, thereby keeping them unavailable to userspace via dynamic copy
    operations. (Note, an implicit form of whitelisting is the use of constant
    sizes in usercopy operations and get_user()/put_user(); these bypass
    hardened usercopy checks since these sizes cannot change at runtime.)

    To support this whitelist annotation, usercopy region offset and size
    members are added to struct kmem_cache. The slab allocator receives a
    new function, kmem_cache_create_usercopy(), that creates a new cache
    with a usercopy region defined, suitable for declaring spans of fields
    within the objects that get copied to/from userspace.

    In this patch, the default kmem_cache_create() marks the entire allocation
    as whitelisted, leaving it semantically unchanged. Once all fine-grained
    whitelists have been added (in subsequent patches), this will be changed
    to a usersize of 0, making caches created with kmem_cache_create() not
    copyable to/from userspace.

    After the entire usercopy whitelist series is applied, less than 15%
    of the slab cache memory remains exposed to potential usercopy bugs
    after a fresh boot:

    Total Slab Memory: 48074720
    Usercopyable Memory: 6367532 13.2%
    task_struct 0.2% 4480/1630720
    RAW 0.3% 300/96000
    RAWv6 2.1% 1408/64768
    ext4_inode_cache 3.0% 269760/8740224
    dentry 11.1% 585984/5273856
    mm_struct 29.1% 54912/188448
    kmalloc-8 100.0% 24576/24576
    kmalloc-16 100.0% 28672/28672
    kmalloc-32 100.0% 81920/81920
    kmalloc-192 100.0% 96768/96768
    kmalloc-128 100.0% 143360/143360
    names_cache 100.0% 163840/163840
    kmalloc-64 100.0% 167936/167936
    kmalloc-256 100.0% 339968/339968
    kmalloc-512 100.0% 350720/350720
    kmalloc-96 100.0% 455616/455616
    kmalloc-8192 100.0% 655360/655360
    kmalloc-1024 100.0% 812032/812032
    kmalloc-4096 100.0% 819200/819200
    kmalloc-2048 100.0% 1310720/1310720

    After some kernel build workloads, the percentage (mainly driven by
    dentry and inode caches expanding) drops under 10%:

    Total Slab Memory: 95516184
    Usercopyable Memory: 8497452 8.8%
    task_struct 0.2% 4000/1456000
    RAW 0.3% 300/96000
    RAWv6 2.1% 1408/64768
    ext4_inode_cache 3.0% 1217280/39439872
    dentry 11.1% 1623200/14608800
    mm_struct 29.1% 73216/251264
    kmalloc-8 100.0% 24576/24576
    kmalloc-16 100.0% 28672/28672
    kmalloc-32 100.0% 94208/94208
    kmalloc-192 100.0% 96768/96768
    kmalloc-128 100.0% 143360/143360
    names_cache 100.0% 163840/163840
    kmalloc-64 100.0% 245760/245760
    kmalloc-256 100.0% 339968/339968
    kmalloc-512 100.0% 350720/350720
    kmalloc-96 100.0% 563520/563520
    kmalloc-8192 100.0% 655360/655360
    kmalloc-1024 100.0% 794624/794624
    kmalloc-4096 100.0% 819200/819200
    kmalloc-2048 100.0% 1257472/1257472

    Signed-off-by: David Windsor
    [kees: adjust commit log, split out a few extra kmalloc hunks]
    [kees: add field names to function declarations]
    [kees: convert BUGs to WARNs and fail closed]
    [kees: add attack surface reduction analysis to commit log]
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Cc: linux-xfs@vger.kernel.org
    Signed-off-by: Kees Cook
    Acked-by: Christoph Lameter

    David Windsor
     
  • This refactors the hardened usercopy code so that failure reporting can
    happen within the checking functions instead of at the top level. This
    simplifies the return value handling and allows more details and offsets
    to be included in the report. Having the offset can be much more helpful
    in understanding hardened usercopy bugs.

    Signed-off-by: Kees Cook

    Kees Cook
     

16 Nov, 2017

5 commits

  • As the page free path makes no distinction between cache hot and cold
    pages, there is no real useful ordering of pages in the free list that
    allocation requests can take advantage of. Juding from the users of
    __GFP_COLD, it is likely that a number of them are the result of copying
    other sites instead of actually measuring the impact. Remove the
    __GFP_COLD parameter which simplifies a number of paths in the page
    allocator.

    This is potentially controversial but bear in mind that the size of the
    per-cpu pagelists versus modern cache sizes means that the whole per-cpu
    list can often fit in the L3 cache. Hence, there is only a potential
    benefit for microbenchmarks that alloc/free pages in a tight loop. It's
    even worse when THP is taken into account which has little or no chance
    of getting a cache-hot page as the per-cpu list is bypassed and the
    zeroing of multiple pages will thrash the cache anyway.

    The truncate microbenchmarks are not shown as this patch affects the
    allocation path and not the free path. A page fault microbenchmark was
    tested but it showed no sigificant difference which is not surprising
    given that the __GFP_COLD branches are a miniscule percentage of the
    fault path.

    Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Now that kmemcheck is gone, we don't need the NOTRACK flags.

    Link: http://lkml.kernel.org/r/20171007030159.22241-5-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Levin, Alexander (Sasha Levin)
     
  • Patch series "Add kmalloc_array_node() and kcalloc_node()".

    Our current memeory allocation routines suffer form an API imbalance,
    for one we have kmalloc_array() and kcalloc() which check for overflows
    in size multiplication and we have kmalloc_node() and kzalloc_node()
    which allow for memory allocation on a certain NUMA node but don't check
    for eventual overflows.

    This patch (of 6):

    We have kmalloc_array() and kcalloc() wrappers on top of kmalloc() which
    ensure us overflow free multiplication for the size of a memory
    allocation but these implementations are not NUMA-aware.

    Likewise we have kmalloc_node() which is a NUMA-aware version of
    kmalloc() but the implementation is not aware of any possible overflows
    in eventual size calculations.

    Introduce a combination of the two above cases to have a NUMA-node aware
    version of kmalloc_array() and kcalloc().

    Link: http://lkml.kernel.org/r/20170927082038.3782-2-jthumshirn@suse.de
    Signed-off-by: Johannes Thumshirn
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Christoph Lameter
    Cc: Damien Le Moal
    Cc: David Rientjes
    Cc: "David S. Miller"
    Cc: Doug Ledford
    Cc: Hal Rosenstock
    Cc: Jens Axboe
    Cc: Joonsoo Kim
    Cc: Mike Marciniszyn
    Cc: Pekka Enberg
    Cc: Santosh Shilimkar
    Cc: Sean Hefty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Thumshirn
     
  • struct kmem_cache::flags is "unsigned long" which is unnecessary on
    64-bit as no flags are defined in the higher bits.

    Switch the field to 32-bit and save some space on x86_64 until such
    flags appear:

    add/remove: 0/0 grow/shrink: 0/107 up/down: 0/-657 (-657)
    function old new delta
    sysfs_slab_add 720 719 -1
    ...
    check_object 699 676 -23

    [akpm@linux-foundation.org: fix printk warning]
    Link: http://lkml.kernel.org/r/20171021100635.GA8287@avx2
    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON,
    etc).

    SLAB is bloated temporarily by switching to "unsigned long", but only
    temporarily.

    Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2
    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman