07 Nov, 2019

1 commit

  • page_cgroup_ino() doesn't return a valid memcg pointer for non-compound
    slab pages, because it depends on PgHead AND PgSlab flags to be set to
    determine the memory cgroup from the kmem_cache. It's correct for
    compound pages, but not for generic small pages. Those don't have PgHead
    set, so it ends up returning zero.

    Fix this by replacing the condition to PageSlab() && !PageTail().

    Before this patch:
    [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
    0x0000000000000080 38 0 _______S___________________________________ slab

    After this patch:
    [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
    0x0000000000000080 147 0 _______S___________________________________ slab

    Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order
    to filter error injection events based on memcg. So if
    page_cgroup_ino() fails to return memcg pointer, we just fail to inject
    memory error. Considering that hwpoison filter is for testing, affected
    users are limited and the impact should be marginal.

    [n-horiguchi@ah.jp.nec.com: changelog additions]
    Link: http://lkml.kernel.org/r/20191031012151.2722280-1-guro@fb.com
    Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: David Rientjes
    Cc: Vladimir Davydov
    Cc: Daniel Jordan
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

25 Sep, 2019

2 commits

  • The memcg_cache_params structure is only embedded into the kmem_cache of
    slab and slub allocators as defined in slab_def.h and slub_def.h and used
    internally by mm code. There is no needed to expose it in a public
    header. So move it from include/linux/slab.h to mm/slab.h. It is just a
    refactoring patch with no code change.

    In fact both the slub_def.h and slab_def.h should be moved into the mm
    directory as well, but that will probably cause many merge conflicts.

    Link: http://lkml.kernel.org/r/20190718180827.18758-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Currently, a value of '1" is written to /sys/kernel/slab//shrink
    file to shrink the slab by flushing out all the per-cpu slabs and free
    slabs in partial lists. This can be useful to squeeze out a bit more
    memory under extreme condition as well as making the active object counts
    in /proc/slabinfo more accurate.

    This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
    option is usually not enabled and "slub_memcg_sysfs=1" not set. Even if
    memcg sysfs is turned on, it is too cumbersome and impractical to manage
    all those per-memcg sysfs files in a real production system.

    So there is no practical way to shrink memcg caches. Fix this by enabling
    a proper write to the shrink sysfs file of the root cache to scan all the
    available memcg caches and shrink them as well. For a non-root memcg
    cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is on), only that
    cache will be shrunk when written.

    On a 2-socket 64-core 256-thread arm64 system with 64k page after
    a parallel kernel build, the the amount of memory occupied by slabs
    before shrinking slabs were:

    # grep task_struct /proc/slabinfo
    task_struct 53137 53192 4288 61 4 : tunables 0 0
    0 : slabdata 872 872 0
    # grep "^S[lRU]" /proc/meminfo
    Slab: 3936832 kB
    SReclaimable: 399104 kB
    SUnreclaim: 3537728 kB

    After shrinking slabs (by echoing "1" to all shrink files):

    # grep "^S[lRU]" /proc/meminfo
    Slab: 1356288 kB
    SReclaimable: 263296 kB
    SUnreclaim: 1092992 kB
    # grep task_struct /proc/slabinfo
    task_struct 2764 6832 4288 61 4 : tunables 0 0
    0 : slabdata 112 112 0

    Link: http://lkml.kernel.org/r/20190723151445.7385-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Acked-by: Roman Gushchin
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     

13 Jul, 2019

10 commits

  • Patch series "add init_on_alloc/init_on_free boot options", v10.

    Provide init_on_alloc and init_on_free boot options.

    These are aimed at preventing possible information leaks and making the
    control-flow bugs that depend on uninitialized values more deterministic.

    Enabling either of the options guarantees that the memory returned by the
    page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
    isn't supported at the moment, as its emulation of kmem caches complicates
    handling of SLAB_TYPESAFE_BY_RCU caches correctly.

    Enabling init_on_free also guarantees that pages and heap objects are
    initialized right after they're freed, so it won't be possible to access
    stale data by using a dangling pointer.

    As suggested by Michal Hocko, right now we don't let the heap users to
    disable initialization for certain allocations. There's not enough
    evidence that doing so can speed up real-life cases, and introducing ways
    to opt-out may result in things going out of control.

    This patch (of 2):

    The new options are needed to prevent possible information leaks and make
    control-flow bugs that depend on uninitialized values more deterministic.

    This is expected to be on-by-default on Android and Chrome OS. And it
    gives the opportunity for anyone else to use it under distros too via the
    boot args. (The init_on_free feature is regularly requested by folks
    where memory forensics is included in their threat models.)

    init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
    objects with zeroes. Initialization is done at allocation time at the
    places where checks for __GFP_ZERO are performed.

    init_on_free=1 makes the kernel initialize freed pages and heap objects
    with zeroes upon their deletion. This helps to ensure sensitive data
    doesn't leak via use-after-free accesses.

    Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
    returns zeroed memory. The two exceptions are slab caches with
    constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
    zero-initialized to preserve their semantics.

    Both init_on_alloc and init_on_free default to zero, but those defaults
    can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
    CONFIG_INIT_ON_FREE_DEFAULT_ON.

    If either SLUB poisoning or page poisoning is enabled, those options take
    precedence over init_on_alloc and init_on_free: initialization is only
    applied to unpoisoned allocations.

    Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:

    hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
    hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)

    Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
    Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
    Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
    Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)

    The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
    is within the standard error.

    The new features are also going to pave the way for hardware memory
    tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
    hooks to set the tags for heap objects. With MTE, tagging will have the
    same cost as memory initialization.

    Although init_on_free is rather costly, there are paranoid use-cases where
    in-memory data lifetime is desired to be minimized. There are various
    arguments for/against the realism of the associated threat models, but
    given that we'll need the infrastructure for MTE anyway, and there are
    people who want wipe-on-free behavior no matter what the performance cost,
    it seems reasonable to include it in this series.

    [glider@google.com: v8]
    Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
    [glider@google.com: v10]
    Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
    Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
    Signed-off-by: Alexander Potapenko
    Acked-by: Kees Cook
    Acked-by: Michal Hocko [page and dmapool parts
    Acked-by: James Morris ]
    Cc: Christoph Lameter
    Cc: Masahiro Yamada
    Cc: "Serge E. Hallyn"
    Cc: Nick Desaulniers
    Cc: Kostya Serebryany
    Cc: Dmitry Vyukov
    Cc: Sandeep Patil
    Cc: Laura Abbott
    Cc: Randy Dunlap
    Cc: Jann Horn
    Cc: Mark Rutland
    Cc: Marco Elver
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Let's reparent non-root kmem_caches on memcg offlining. This allows us to
    release the memory cgroup without waiting for the last outstanding kernel
    object (e.g. dentry used by another application).

    Since the parent cgroup is already charged, everything we need to do is to
    splice the list of kmem_caches to the parent's kmem_caches list, swap the
    memcg pointer, drop the css refcounter for each kmem_cache and adjust the
    parent's css refcounter.

    Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer
    anymore. It's safe to read it under rcu_read_lock(), cgroup_mutex held,
    or any other way that protects the memory cgroup from being released.

    We can race with the slab allocation and deallocation paths. It's not a
    big problem: parent's charge and slab global stats are always correct, and
    we don't care anymore about the child usage and global stats. The child
    cgroup is already offline, so we don't use or show it anywhere.

    Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't
    used anywhere except count_shadow_nodes(). But even there it won't break
    anything: after reparenting "nodes" will be 0 on child level (because
    we're already reparenting shrinker lists), and on parent level page stats
    always were 0, and this patch won't change anything.

    [guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup]
    Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Every slab page charged to a non-root memory cgroup has a pointer to the
    memory cgroup and holds a reference to it, which protects a non-empty
    memory cgroup from being released. At the same time the page has a
    pointer to the corresponding kmem_cache, and also hold a reference to the
    kmem_cache. And kmem_cache by itself holds a reference to the cgroup.

    So there is clearly some redundancy, which allows to stop setting the
    page->mem_cgroup pointer and rely on getting memcg pointer indirectly via
    kmem_cache. Further it will allow to change this pointer easier, without
    a need to go over all charged pages.

    So let's stop setting page->mem_cgroup pointer for slab pages, and stop
    using the css refcounter directly for protecting the memory cgroup from
    going away. Instead rely on kmem_cache as an intermediate object.

    Make sure that vmstats and shrinker lists are working as previously, as
    well as /proc/kpagecgroup interface.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-10-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently each charged slab page holds a reference to the cgroup to which
    it's charged. Kmem_caches are held by the memcg and are released all
    together with the memory cgroup. It means that none of kmem_caches are
    released unless at least one reference to the memcg exists, which is very
    far from optimal.

    Let's rework it in a way that allows releasing individual kmem_caches as
    soon as the cgroup is offline, the kmem_cache is empty and there are no
    pending allocations.

    To make it possible, let's introduce a new percpu refcounter for non-root
    kmem caches. The counter is initialized to the percpu mode, and is
    switched to the atomic mode during kmem_cache deactivation. The counter
    is bumped for every charged page and also for every running allocation.
    So the kmem_cache can't be released unless all allocations complete.

    To shutdown non-active empty kmem_caches, let's reuse the work queue,
    previously used for the kmem_cache deactivation. Once the reference
    counter reaches 0, let's schedule an asynchronous kmem_cache release.

    * I used the following simple approach to test the performance
    (stolen from another patchset by T. Harding):

    time find / -name fname-no-exist
    echo 2 > /proc/sys/vm/drop_caches
    repeat 10 times

    Results:

    orig patched

    real 0m1.455s real 0m1.355s
    user 0m0.206s user 0m0.219s
    sys 0m0.855s sys 0m0.807s

    real 0m1.487s real 0m1.699s
    user 0m0.221s user 0m0.256s
    sys 0m0.806s sys 0m0.948s

    real 0m1.515s real 0m1.505s
    user 0m0.183s user 0m0.215s
    sys 0m0.876s sys 0m0.858s

    real 0m1.291s real 0m1.380s
    user 0m0.193s user 0m0.198s
    sys 0m0.843s sys 0m0.786s

    real 0m1.364s real 0m1.374s
    user 0m0.180s user 0m0.182s
    sys 0m0.868s sys 0m0.806s

    real 0m1.352s real 0m1.312s
    user 0m0.201s user 0m0.212s
    sys 0m0.820s sys 0m0.761s

    real 0m1.302s real 0m1.349s
    user 0m0.205s user 0m0.203s
    sys 0m0.803s sys 0m0.792s

    real 0m1.334s real 0m1.301s
    user 0m0.194s user 0m0.201s
    sys 0m0.806s sys 0m0.779s

    real 0m1.426s real 0m1.434s
    user 0m0.216s user 0m0.181s
    sys 0m0.824s sys 0m0.864s

    real 0m1.350s real 0m1.295s
    user 0m0.200s user 0m0.190s
    sys 0m0.842s sys 0m0.811s

    So it looks like the difference is not noticeable in this test.

    [cai@lca.pw: fix an use-after-free in kmemcg_workfn()]
    Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw
    Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.com
    Signed-off-by: Roman Gushchin
    Signed-off-by: Qian Cai
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Shakeel Butt
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently the page accounting code is duplicated in SLAB and SLUB
    internals. Let's move it into new (un)charge_slab_page helpers in the
    slab_common.c file. These helpers will be responsible for statistics
    (global and memcg-aware) and memcg charging. So they are replacing direct
    memcg_(un)charge_slab() calls.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Christoph Lameter
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently SLUB uses a work scheduled after an RCU grace period to
    deactivate a non-root kmem_cache. This mechanism can be reused for
    kmem_caches release, but requires generalization for SLAB case.

    Introduce kmemcg_cache_deactivate() function, which calls
    allocator-specific __kmem_cache_deactivate() and schedules execution of
    __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
    context after an rcu grace period.

    Here is the new calling scheme:
    kmemcg_cache_deactivate()
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    __kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific

    instead of:
    __kmemcg_cache_deactivate() SLAB/SLUB-specific
    slab_deactivate_memcg_cache_rcu_sched() SLUB-only
    kmemcg_rcufn() rcu
    kmemcg_workfn() work
    kmemcg_cache_deact_after_rcu() SLUB-only

    For consistency, all allocator-specific functions start with "__".

    Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The delayed work/rcu deactivation infrastructure of non-root kmem_caches
    can be also used for asynchronous release of these objects. Let's get rid
    of the word "deactivation" in corresponding names to make the code look
    better after generalization.

    It's easier to make the renaming first, so that the generalized code will
    look consistent from scratch.

    Let's rename struct memcg_cache_params fields:
    deact_fn -> work_fn
    deact_rcu_head -> rcu_head
    deact_work -> work

    And RCU/delayed work callbacks in slab common code:
    kmemcg_deactivate_rcufn -> kmemcg_rcufn
    kmemcg_deactivate_workfn -> kmemcg_workfn

    This patch contains no functional changes, only renamings.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-3-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Waiman Long
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "mm: reparent slab memory on cgroup removal", v7.

    # Why do we need this?

    We've noticed that the number of dying cgroups is steadily growing on most
    of our hosts in production. The following investigation revealed an issue
    in the userspace memory reclaim code [1], accounting of kernel stacks [2],
    and also the main reason: slab objects.

    The underlying problem is quite simple: any page charged to a cgroup holds
    a reference to it, so the cgroup can't be reclaimed unless all charged
    pages are gone. If a slab object is actively used by other cgroups, it
    won't be reclaimed, and will prevent the origin cgroup from being
    reclaimed.

    Slab objects, and first of all vfs cache, is shared between cgroups, which
    are using the same underlying fs, and what's even more important, it's
    shared between multiple generations of the same workload. So if something
    is running periodically every time in a new cgroup (like how systemd
    works), we do accumulate multiple dying cgroups.

    Strictly speaking pagecache isn't different here, but there is a key
    difference: we disable protection and apply some extra pressure on LRUs of
    dying cgroups, and these LRUs contain all charged pages. My experiments
    show that with the disabled kernel memory accounting the number of dying
    cgroups stabilizes at a relatively small number (~100, depends on memory
    pressure and cgroup creation rate), and with kernel memory accounting it
    grows pretty steadily up to several thousands.

    Memory cgroups are quite complex and big objects (mostly due to percpu
    stats), so it leads to noticeable memory losses. Memory occupied by dying
    cgroups is measured in hundreds of megabytes. I've even seen a host with
    more than 100Gb of memory wasted for dying cgroups. It leads to a
    degradation of performance with the uptime, and generally limits the usage
    of cgroups.

    My previous attempt [3] to fix the problem by applying extra pressure on
    slab shrinker lists caused a regressions with xfs and ext4, and has been
    reverted [4]. The following attempts to find the right balance [5, 6]
    were not successful.

    So instead of trying to find a maybe non-existing balance, let's do
    reparent accounted slab caches to the parent cgroup on cgroup removal.

    # Implementation approach

    There is however a significant problem with reparenting of slab memory:
    there is no list of charged pages. Some of them are in shrinker lists,
    but not all. Introducing of a new list is really not an option.

    But fortunately there is a way forward: every slab page has a stable
    pointer to the corresponding kmem_cache. So the idea is to reparent
    kmem_caches instead of slab pages.

    It's actually simpler and cheaper, but requires some underlying changes:
    1) Make kmem_caches to hold a single reference to the memory cgroup,
    instead of a separate reference per every slab page.
    2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
    page->kmem_cache->memcg indirection instead. It's used only on
    slab page release, so performance overhead shouldn't be a big issue.
    3) Introduce a refcounter for non-root slab caches. It's required to
    be able to destroy kmem_caches when they become empty and release
    the associated memory cgroup.

    There is a bonus: currently we release all memcg kmem_caches all together
    with the memory cgroup itself. This patchset allows individual
    kmem_caches to be released as soon as they become inactive and free.

    Some additional implementation details are provided in corresponding
    commit messages.

    # Results

    Below is the average number of dying cgroups on two groups of our
    production hosts. They do run some sort of web frontend workload, the
    memory pressure is moderate. As we can see, with the kernel memory
    reparenting the number stabilizes in 60s range; however with the original
    version it grows almost linearly and doesn't show any signs of plateauing.
    The difference in slab and percpu usage between patched and unpatched
    versions also grows linearly. In 7 days it exceeded 200Mb.

    day 0 1 2 3 4 5 6 7
    original 56 362 628 752 1070 1250 1490 1560
    patched 23 46 51 55 60 57 67 69
    mem diff(Mb) 22 74 123 152 164 182 214 241

    # Links

    [1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error")
    [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
    [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
    [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
    [5]: https://lkml.org/lkml/2019/1/28/1865
    [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2

    This patch (of 10):

    Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
    rather than in init_memcg_params().

    Once kmem_cache will hold a reference to the memory cgroup, it will
    simplify the refcounting.

    For non-root kmem_caches memcg_link_cache() is always called before the
    kmem_cache becomes visible to a user, so it's safe.

    Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Waiman Long
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrei Vagin
    Cc: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • This avoids any possible type confusion when looking up an object. For
    example, if a non-slab were to be passed to kfree(), the invalid
    slab_cache pointer (i.e. overlapped with some other value from the
    struct page union) would be used for subsequent slab manipulations that
    could lead to further memory corruption.

    Since the page is already in cache, adding the PageSlab() check will
    have nearly zero cost, so add a check and WARN() to virt_to_cache().
    Additionally replaces an open-coded virt_to_cache(). To support the
    failure mode this also updates all callers of virt_to_cache() and
    cache_from_obj() to handle a NULL cache pointer return value (though
    note that several already handle this case gracefully).

    [dan.carpenter@oracle.com: restore IRQs in kfree()]
    Link: http://lkml.kernel.org/r/20190613065637.GE16334@mwanda
    Link: http://lkml.kernel.org/r/20190530045017.15252-3-keescook@chromium.org
    Signed-off-by: Kees Cook
    Signed-off-by: Dan Carpenter
    Cc: Alexander Popov
    Cc: Alexander Potapenko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Greg Kroah-Hartman
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Patch series "mm/slab: Improved sanity checking".

    This adds defenses against slab cache confusion (as seen in real-world
    exploits[1]) and gracefully handles type confusions when trying to look
    up slab caches from an arbitrary page. (Also is patch 3: new LKDTM
    tests for these defenses as well as for the existing double-free
    detection.

    This patch (of 3):

    When building under CONFIG_SLAB_FREELIST_HARDENING, it makes sense to
    perform sanity-checking on the assumed slab cache during
    kmem_cache_free() to make sure the kernel doesn't mix freelists across
    slab caches and corrupt memory (as seen in the exploitation of flaws
    like CVE-2018-9568[1]). Note that the prior code might WARN() but still
    corrupt memory (i.e. return the assumed cache instead of the owned
    cache).

    There is no noticeable performance impact (changes are within noise).
    Measuring parallel kernel builds, I saw the following with
    CONFIG_SLAB_FREELIST_HARDENED, before and after this patch:

    before:

    Run times: 288.85 286.53 287.09 287.07 287.21
    Min: 286.53 Max: 288.85 Mean: 287.35 Std Dev: 0.79

    after:

    Run times: 289.58 287.40 286.97 287.20 287.01
    Min: 286.97 Max: 289.58 Mean: 287.63 Std Dev: 0.99

    Delta: 0.1% which is well below the standard deviation

    [1] https://github.com/ThomasKing2014/slides/raw/master/Building%20universal%20Android%20rooting%20with%20a%20type%20confusion%20vulnerability.pdf

    Link: http://lkml.kernel.org/r/20190530045017.15252-2-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Alexander Popov
    Cc: Alexander Potapenko
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

30 Mar, 2019

1 commit

  • Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
    v6.

    This is a followup to the discussion in [1], [2].

    IOMMUs using ARMv7 short-descriptor format require page tables (level 1
    and 2) to be allocated within the first 4GB of RAM, even on 64-bit
    systems.

    For L1 tables that are bigger than a page, we can just use
    __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
    use GFP_DMA).

    For L2 tables that only take 1KB, it would be a waste to allocate a full
    page, so we considered 3 approaches:
    1. This series, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2 page
    tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable to reuse
    freed fragments until the whole page is freed. [3]

    This series is the most memory-efficient approach.

    stable@ note:
    We confirmed that this is a regression, and IOMMU errors happen on 4.19
    and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
    most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
    with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
    platforms (and maybe others?).

    [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
    [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
    [3] https://patchwork.codeaurora.org/patch/671639/

    This patch (of 3):

    IOMMUs using ARMv7 short-descriptor format require page tables to be
    allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
    this is done by passing GFP_DMA32 flag to memory allocation functions.

    For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
    a full page using get_free_pages, so we considered 3 approaches:
    1. This patch, adding support for GFP_DMA32 slab caches.
    2. genalloc, which requires pre-allocating the maximum number of L2
    page tables (4096, so 4MB of memory).
    3. page_frag, which is not very memory-efficient as it is unable
    to reuse freed fragments until the whole page is freed.

    This change makes it possible to create a custom cache in DMA32 zone using
    kmem_cache_create, then allocate memory using kmem_cache_alloc.

    We do not create a DMA32 kmalloc cache array, as there are currently no
    users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
    warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.

    This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
    kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
    unnecessary).

    Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
    Signed-off-by: Nicolas Boichat
    Acked-by: Vlastimil Babka
    Acked-by: Will Deacon
    Cc: Robin Murphy
    Cc: Joerg Roedel
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Sasha Levin
    Cc: Huaisheng Ye
    Cc: Mike Rapoport
    Cc: Yong Wu
    Cc: Matthias Brugger
    Cc: Tomasz Figa
    Cc: Yingjoe Chen
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc: Hsin-Yi Wang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Boichat
     

06 Mar, 2019

1 commit

  • Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
    functions, so, the users don't have to explicitly check that condition.

    This is purely code cleanup patch without any functional change. Only
    the order of checks in memcg_charge_slab() can potentially be changed
    but the functionally it will be same. This should not matter as
    memcg_charge_slab() is not in the hot path.

    Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

22 Feb, 2019

2 commits

  • kmemleak keeps two global variables, min_addr and max_addr, which store
    the range of valid (encountered by kmemleak) pointer values, which it
    later uses to speed up pointer lookup when scanning blocks.

    With tagged pointers this range will get bigger than it needs to be. This
    patch makes kmemleak untag pointers before saving them to min_addr and
    max_addr and when performing a lookup.

    Link: http://lkml.kernel.org/r/16e887d442986ab87fe87a755815ad92fa431a5f.1550066133.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Tested-by: Qian Cai
    Acked-by: Catalin Marinas
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dmitry Vyukov
    Cc: Evgeniy Stepanov
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Pekka Enberg
    Cc: Vincenzo Frascino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Right now we call kmemleak hooks before assigning tags to pointers in
    KASAN hooks. As a result, when an objects gets allocated, kmemleak sees a
    differently tagged pointer, compared to the one it sees when the object
    gets freed. Fix it by calling KASAN hooks before kmemleak's ones.

    Link: http://lkml.kernel.org/r/cd825aa4897b0fc37d3316838993881daccbe9f5.1549921721.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reported-by: Qian Cai
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Catalin Marinas
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dmitry Vyukov
    Cc: Evgeniy Stepanov
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Pekka Enberg
    Cc: Vincenzo Frascino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

29 Dec, 2018

1 commit

  • Patch series "kasan: add software tag-based mode for arm64", v13.

    This patchset adds a new software tag-based mode to KASAN [1]. (Initially
    this mode was called KHWASAN, but it got renamed, see the naming rationale
    at the end of this section).

    The plan is to implement HWASan [2] for the kernel with the incentive,
    that it's going to have comparable to KASAN performance, but in the same
    time consume much less memory, trading that off for somewhat imprecise bug
    detection and being supported only for arm64.

    The underlying ideas of the approach used by software tag-based KASAN are:

    1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
    pointer tags in the top byte of each kernel pointer.

    2. Using shadow memory, we can store memory tags for each chunk of kernel
    memory.

    3. On each memory allocation, we can generate a random tag, embed it into
    the returned pointer and set the memory tags that correspond to this
    chunk of memory to the same value.

    4. By using compiler instrumentation, before each memory access we can add
    a check that the pointer tag matches the tag of the memory that is being
    accessed.

    5. On a tag mismatch we report an error.

    With this patchset the existing KASAN mode gets renamed to generic KASAN,
    with the word "generic" meaning that the implementation can be supported
    by any architecture as it is purely software.

    The new mode this patchset adds is called software tag-based KASAN. The
    word "tag-based" refers to the fact that this mode uses tags embedded into
    the top byte of kernel pointers and the TBI arm64 CPU feature that allows
    to dereference such pointers. The word "software" here means that shadow
    memory manipulation and tag checking on pointer dereference is done in
    software. As it is the only tag-based implementation right now, "software
    tag-based" KASAN is sometimes referred to as simply "tag-based" in this
    patchset.

    A potential expansion of this mode is a hardware tag-based mode, which
    would use hardware memory tagging support (announced by Arm [3]) instead
    of compiler instrumentation and manual shadow memory manipulation.

    Same as generic KASAN, software tag-based KASAN is strictly a debugging
    feature.

    [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html

    [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html

    [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a

    ====== Rationale

    On mobile devices generic KASAN's memory usage is significant problem.
    One of the main reasons to have tag-based KASAN is to be able to perform a
    similar set of checks as the generic one does, but with lower memory
    requirements.

    Comment from Vishwath Mohan :

    I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
    problematic to enable for environments that don't tolerate the increased
    memory pressure well. This includes

    (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
    (c) Connected components like Pixel's visual core [1].

    These are both places I'd love to have a low(er) memory footprint option at
    my disposal.

    Comment from Evgenii Stepanov :

    Looking at a live Android device under load, slab (according to
    /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
    overhead of 2x - 3x on top of it is not insignificant.

    Not having this overhead enables near-production use - ex. running
    KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
    not reproduce in test configuration. These are the ones that often cost
    the most engineering time to track down.

    CPU overhead is bad, but generally tolerable. RAM is critical, in our
    experience. Once it gets low enough, OOM-killer makes your life
    miserable.

    [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/

    ====== Technical details

    Software tag-based KASAN mode is implemented in a very similar way to the
    generic one. This patchset essentially does the following:

    1. TCR_TBI1 is set to enable Top Byte Ignore.

    2. Shadow memory is used (with a different scale, 1:16, so each shadow
    byte corresponds to 16 bytes of kernel memory) to store memory tags.

    3. All slab objects are aligned to shadow scale, which is 16 bytes.

    4. All pointers returned from the slab allocator are tagged with a random
    tag and the corresponding shadow memory is poisoned with the same value.

    5. Compiler instrumentation is used to insert tag checks. Either by
    calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
    CONFIG_KASAN_INLINE flags are reused).

    6. When a tag mismatch is detected in callback instrumentation mode
    KASAN simply prints a bug report. In case of inline instrumentation,
    clang inserts a brk instruction, and KASAN has it's own brk handler,
    which reports the bug.

    7. The memory in between slab objects is marked with a reserved tag, and
    acts as a redzone.

    8. When a slab object is freed it's marked with a reserved tag.

    Bug detection is imprecise for two reasons:

    1. We won't catch some small out-of-bounds accesses, that fall into the
    same shadow cell, as the last byte of a slab object.

    2. We only have 1 byte to store tags, which means we have a 1/256
    probability of a tag match for an incorrect access (actually even
    slightly less due to reserved tag values).

    Despite that there's a particular type of bugs that tag-based KASAN can
    detect compared to generic KASAN: use-after-free after the object has been
    allocated by someone else.

    ====== Testing

    Some kernel developers voiced a concern that changing the top byte of
    kernel pointers may lead to subtle bugs that are difficult to discover.
    To address this concern deliberate testing has been performed.

    It doesn't seem feasible to do some kind of static checking to find
    potential issues with pointer tagging, so a dynamic approach was taken.
    All pointer comparisons/subtractions have been instrumented in an LLVM
    compiler pass and a kernel module that would print a bug report whenever
    two pointers with different tags are being compared/subtracted (ignoring
    comparisons with NULL pointers and with pointers obtained by casting an
    error code to a pointer type) has been used. Then the kernel has been
    booted in QEMU and on an Odroid C2 board and syzkaller has been run.

    This yielded the following results.

    The two places that look interesting are:

    is_vmalloc_addr in include/linux/mm.h
    is_kernel_rodata in mm/util.c

    Here we compare a pointer with some fixed untagged values to make sure
    that the pointer lies in a particular part of the kernel address space.
    Since tag-based KASAN doesn't add tags to pointers that belong to rodata
    or vmalloc regions, this should work as is. To make sure debug checks to
    those two functions that check that the result doesn't change whether we
    operate on pointers with or without untagging has been added.

    A few other cases that don't look that interesting:

    Comparing pointers to achieve unique sorting order of pointee objects
    (e.g. sorting locks addresses before performing a double lock):

    tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
    pipe_double_lock in fs/pipe.c
    unix_state_double_lock in net/unix/af_unix.c
    lock_two_nondirectories in fs/inode.c
    mutex_lock_double in kernel/events/core.c

    ep_cmp_ffd in fs/eventpoll.c
    fsnotify_compare_groups fs/notify/mark.c

    Nothing needs to be done here, since the tags embedded into pointers
    don't change, so the sorting order would still be unique.

    Checks that a pointer belongs to some particular allocation:

    is_sibling_entry in lib/radix-tree.c
    object_is_on_stack in include/linux/sched/task_stack.h

    Nothing needs to be done here either, since two pointers can only belong
    to the same allocation if they have the same tag.

    Overall, since the kernel boots and works, there are no critical bugs.
    As for the rest, the traditional kernel testing way (use until fails) is
    the only one that looks feasible.

    Another point here is that tag-based KASAN is available under a separate
    config option that needs to be deliberately enabled. Even though it might
    be used in a "near-production" environment to find bugs that are not found
    during fuzzing or running tests, it is still a debug tool.

    ====== Benchmarks

    The following numbers were collected on Odroid C2 board. Both generic and
    tag-based KASAN were used in inline instrumentation mode.

    Boot time [1]:
    * ~1.7 sec for clean kernel
    * ~5.0 sec for generic KASAN
    * ~5.0 sec for tag-based KASAN

    Network performance [2]:
    * 8.33 Gbits/sec for clean kernel
    * 3.17 Gbits/sec for generic KASAN
    * 2.85 Gbits/sec for tag-based KASAN

    Slab memory usage after boot [3]:
    * ~40 kb for clean kernel
    * ~105 kb (~260% overhead) for generic KASAN
    * ~47 kb (~20% overhead) for tag-based KASAN

    KASAN memory overhead consists of three main parts:
    1. Increased slab memory usage due to redzones.
    2. Shadow memory (the whole reserved once during boot).
    3. Quaratine (grows gradually until some preset limit; the more the limit,
    the more the chance to detect a use-after-free).

    Comparing tag-based vs generic KASAN for each of these points:
    1. 20% vs 260% overhead.
    2. 1/16th vs 1/8th of physical memory.
    3. Tag-based KASAN doesn't require quarantine.

    [1] Time before the ext4 driver is initialized.
    [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
    [3] Measured as `cat /proc/meminfo | grep Slab`.

    ====== Some notes

    A few notes:

    1. The patchset can be found here:
    https://github.com/xairy/kasan-prototype/tree/khwasan

    2. Building requires a recent Clang version (7.0.0 or later).

    3. Stack instrumentation is not supported yet and will be added later.

    This patch (of 25):

    Tag-based KASAN changes the value of the top byte of pointers returned
    from the kernel allocation functions (such as kmalloc). This patch
    updates KASAN hooks signatures and their usage in SLAB and SLUB code to
    reflect that.

    Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Mark Rutland
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

18 Aug, 2018

1 commit

  • Introduce new config option, which is used to replace repeating
    CONFIG_MEMCG && !CONFIG_SLOB pattern. Next patches add a little more
    memcg+kmem related code, so let's keep the defines more clearly.

    Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

06 Apr, 2018

6 commits

  • The kasan quarantine is designed to delay freeing slab objects to catch
    use-after-free. The quarantine can be large (several percent of machine
    memory size). When kmem_caches are deleted related objects are flushed
    from the quarantine but this requires scanning the entire quarantine
    which can be very slow. We have seen the kernel busily working on this
    while holding slab_mutex and badly affecting cache_reaper, slabinfo
    readers and memcg kmem cache creations.

    It can easily reproduced by following script:

    yes . | head -1000000 | xargs stat > /dev/null
    for i in `seq 1 10`; do
    seq 500 | (cd /cg/memory && xargs mkdir)
    seq 500 | xargs -I{} sh -c 'echo $BASHPID > \
    /cg/memory/{}/tasks && exec stat .' > /dev/null
    seq 500 | (cd /cg/memory && xargs rmdir)
    done

    The busy stack:
    kasan_cache_shutdown
    shutdown_cache
    memcg_destroy_kmem_caches
    mem_cgroup_css_free
    css_free_rwork_fn
    process_one_work
    worker_thread
    kthread
    ret_from_fork

    This patch is based on the observation that if the kmem_cache to be
    destroyed is empty then there should not be any objects of this cache in
    the quarantine.

    Without the patch the script got stuck for couple of hours. With the
    patch the script completed within a second.

    Link: http://lkml.kernel.org/r/20180327230603.54721-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Reviewed-by: Andrew Morton
    Acked-by: Andrey Ryabinin
    Acked-by: Christoph Lameter
    Cc: Vladimir Davydov
    Cc: Alexander Potapenko
    Cc: Greg Thelen
    Cc: Dmitry Vyukov
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • If kmem case sizes are 32-bit, then usecopy region should be too.

    Link: http://lkml.kernel.org/r/20180305200730.15812-21-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Cc: David Miller
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Now that all sizes are properly typed, propagate "unsigned int" down the
    callgraph.

    Link: http://lkml.kernel.org/r/20180305200730.15812-19-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • struct kmem_cache::size and ::align were always 32-bit.

    Out of curiosity I created 4GB kmem_cache, it oopsed with division by 0.
    kmem_cache_create(1UL<
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • struct kmem_cache::size has always been "int", all those
    "size_t size" are fake.

    Link: http://lkml.kernel.org/r/20180305200730.15812-5-adobriyan@gmail.com
    Signed-off-by: Alexey Dobriyan
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • KMALLOC_MAX_CACHE_SIZE is 32-bit so is the largest kmalloc cache size.

    Christoph said:
    :
    : Ok SLABs maximum allocation size is limited to 32M (see
    : include/linux/slab.h:
    :
    : #define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT - 1)
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

04 Feb, 2018

1 commit

  • Pull hardened usercopy whitelisting from Kees Cook:
    "Currently, hardened usercopy performs dynamic bounds checking on slab
    cache objects. This is good, but still leaves a lot of kernel memory
    available to be copied to/from userspace in the face of bugs.

    To further restrict what memory is available for copying, this creates
    a way to whitelist specific areas of a given slab cache object for
    copying to/from userspace, allowing much finer granularity of access
    control.

    Slab caches that are never exposed to userspace can declare no
    whitelist for their objects, thereby keeping them unavailable to
    userspace via dynamic copy operations. (Note, an implicit form of
    whitelisting is the use of constant sizes in usercopy operations and
    get_user()/put_user(); these bypass all hardened usercopy checks since
    these sizes cannot change at runtime.)

    This new check is WARN-by-default, so any mistakes can be found over
    the next several releases without breaking anyone's system.

    The series has roughly the following sections:
    - remove %p and improve reporting with offset
    - prepare infrastructure and whitelist kmalloc
    - update VFS subsystem with whitelists
    - update SCSI subsystem with whitelists
    - update network subsystem with whitelists
    - update process memory with whitelists
    - update per-architecture thread_struct with whitelists
    - update KVM with whitelists and fix ioctl bug
    - mark all other allocations as not whitelisted
    - update lkdtm for more sensible test overage"

    * tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits)
    lkdtm: Update usercopy tests for whitelisting
    usercopy: Restrict non-usercopy caches to size 0
    kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
    kvm: whitelist struct kvm_vcpu_arch
    arm: Implement thread_struct whitelist for hardened usercopy
    arm64: Implement thread_struct whitelist for hardened usercopy
    x86: Implement thread_struct whitelist for hardened usercopy
    fork: Provide usercopy whitelisting for task_struct
    fork: Define usercopy region in thread_stack slab caches
    fork: Define usercopy region in mm_struct slab caches
    net: Restrict unwhitelisted proto caches to size 0
    sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
    sctp: Define usercopy region in SCTP proto slab cache
    caif: Define usercopy region in caif proto slab cache
    ip: Define usercopy region in IP proto slab cache
    net: Define usercopy region in struct proto slab cache
    scsi: Define usercopy region in scsi_sense_cache slab cache
    cifs: Define usercopy region in cifs_request slab cache
    vxfs: Define usercopy region in vxfs_inode slab cache
    ufs: Define usercopy region in ufs_inode_cache slab cache
    ...

    Linus Torvalds
     

01 Feb, 2018

1 commit

  • calculate_alignment() function is only used inside slab_common.c. So
    make it static and let the compiler do more optimizations.

    After this patch there's a small improvement in text and data size.

    $ gcc --version
    gcc (GCC) 7.2.1 20171128

    Before:
    text data bss dec hex filename
    9890457 3828702 1212364 14931523 e3d643 vmlinux

    After:
    text data bss dec hex filename
    9890437 3828670 1212364 14931471 e3d60f vmlinux

    Also I fixed a style problem reported by checkpatch.

    WARNING: Missing a blank line after declarations
    #53: FILE: mm/slab_common.c:286:
    + unsigned long ralign = cache_line_size();
    + while (size
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Byongho Lee
     

16 Jan, 2018

2 commits

  • Mark the kmalloc slab caches as entirely whitelisted. These caches
    are frequently used to fulfill kernel allocations that contain data
    to be copied to/from userspace. Internal-only uses are also common,
    but are scattered in the kernel. For now, mark all the kmalloc caches
    as whitelisted.

    This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
    whitelisting code in the last public patch of grsecurity/PaX based on my
    understanding of the code. Changes or omissions from the original code are
    mine and don't reflect the original grsecurity/PaX code.

    Signed-off-by: David Windsor
    [kees: merged in moved kmalloc hunks, adjust commit log]
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Cc: linux-xfs@vger.kernel.org
    Signed-off-by: Kees Cook
    Acked-by: Christoph Lameter

    David Windsor
     
  • This patch prepares the slab allocator to handle caches having annotations
    (useroffset and usersize) defining usercopy regions.

    This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
    whitelisting code in the last public patch of grsecurity/PaX based on
    my understanding of the code. Changes or omissions from the original
    code are mine and don't reflect the original grsecurity/PaX code.

    Currently, hardened usercopy performs dynamic bounds checking on slab
    cache objects. This is good, but still leaves a lot of kernel memory
    available to be copied to/from userspace in the face of bugs. To further
    restrict what memory is available for copying, this creates a way to
    whitelist specific areas of a given slab cache object for copying to/from
    userspace, allowing much finer granularity of access control. Slab caches
    that are never exposed to userspace can declare no whitelist for their
    objects, thereby keeping them unavailable to userspace via dynamic copy
    operations. (Note, an implicit form of whitelisting is the use of constant
    sizes in usercopy operations and get_user()/put_user(); these bypass
    hardened usercopy checks since these sizes cannot change at runtime.)

    To support this whitelist annotation, usercopy region offset and size
    members are added to struct kmem_cache. The slab allocator receives a
    new function, kmem_cache_create_usercopy(), that creates a new cache
    with a usercopy region defined, suitable for declaring spans of fields
    within the objects that get copied to/from userspace.

    In this patch, the default kmem_cache_create() marks the entire allocation
    as whitelisted, leaving it semantically unchanged. Once all fine-grained
    whitelists have been added (in subsequent patches), this will be changed
    to a usersize of 0, making caches created with kmem_cache_create() not
    copyable to/from userspace.

    After the entire usercopy whitelist series is applied, less than 15%
    of the slab cache memory remains exposed to potential usercopy bugs
    after a fresh boot:

    Total Slab Memory: 48074720
    Usercopyable Memory: 6367532 13.2%
    task_struct 0.2% 4480/1630720
    RAW 0.3% 300/96000
    RAWv6 2.1% 1408/64768
    ext4_inode_cache 3.0% 269760/8740224
    dentry 11.1% 585984/5273856
    mm_struct 29.1% 54912/188448
    kmalloc-8 100.0% 24576/24576
    kmalloc-16 100.0% 28672/28672
    kmalloc-32 100.0% 81920/81920
    kmalloc-192 100.0% 96768/96768
    kmalloc-128 100.0% 143360/143360
    names_cache 100.0% 163840/163840
    kmalloc-64 100.0% 167936/167936
    kmalloc-256 100.0% 339968/339968
    kmalloc-512 100.0% 350720/350720
    kmalloc-96 100.0% 455616/455616
    kmalloc-8192 100.0% 655360/655360
    kmalloc-1024 100.0% 812032/812032
    kmalloc-4096 100.0% 819200/819200
    kmalloc-2048 100.0% 1310720/1310720

    After some kernel build workloads, the percentage (mainly driven by
    dentry and inode caches expanding) drops under 10%:

    Total Slab Memory: 95516184
    Usercopyable Memory: 8497452 8.8%
    task_struct 0.2% 4000/1456000
    RAW 0.3% 300/96000
    RAWv6 2.1% 1408/64768
    ext4_inode_cache 3.0% 1217280/39439872
    dentry 11.1% 1623200/14608800
    mm_struct 29.1% 73216/251264
    kmalloc-8 100.0% 24576/24576
    kmalloc-16 100.0% 28672/28672
    kmalloc-32 100.0% 94208/94208
    kmalloc-192 100.0% 96768/96768
    kmalloc-128 100.0% 143360/143360
    names_cache 100.0% 163840/163840
    kmalloc-64 100.0% 245760/245760
    kmalloc-256 100.0% 339968/339968
    kmalloc-512 100.0% 350720/350720
    kmalloc-96 100.0% 563520/563520
    kmalloc-8192 100.0% 655360/655360
    kmalloc-1024 100.0% 794624/794624
    kmalloc-4096 100.0% 819200/819200
    kmalloc-2048 100.0% 1257472/1257472

    Signed-off-by: David Windsor
    [kees: adjust commit log, split out a few extra kmalloc hunks]
    [kees: add field names to function declarations]
    [kees: convert BUGs to WARNs and fail closed]
    [kees: add attack surface reduction analysis to commit log]
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Cc: linux-xfs@vger.kernel.org
    Signed-off-by: Kees Cook
    Acked-by: Christoph Lameter

    David Windsor
     

16 Nov, 2017

4 commits

  • Convert all allocations that used a NOTRACK flag to stop using it.

    Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Levin, Alexander (Sasha Levin)
     
  • Patch series "kmemcheck: kill kmemcheck", v2.

    As discussed at LSF/MM, kill kmemcheck.

    KASan is a replacement that is able to work without the limitation of
    kmemcheck (single CPU, slow). KASan is already upstream.

    We are also not aware of any users of kmemcheck (or users who don't
    consider KASan as a suitable replacement).

    The only objection was that since KASAN wasn't supported by all GCC
    versions provided by distros at that time we should hold off for 2
    years, and try again.

    Now that 2 years have passed, and all distros provide gcc that supports
    KASAN, kill kmemcheck again for the very same reasons.

    This patch (of 4):

    Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

    [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
    Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
    Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Levin, Alexander (Sasha Levin)
     
  • Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON,
    etc).

    SLAB is bloated temporarily by switching to "unsigned long", but only
    temporarily.

    Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2
    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • The kernel may panic when an oom happens without killable process
    sometimes it is caused by huge unreclaimable slabs used by kernel.

    Although kdump could help debug such problem, however, kdump is not
    available on all architectures and it might be malfunction sometime.
    And, since kernel already panic it is worthy capturing such information
    in dmesg to aid touble shooting.

    Print out unreclaimable slab info (used size and total size) which
    actual memory usage is not zero (num_objs * size != 0) when
    unreclaimable slabs amount is greater than total user memory (LRU
    pages).

    The output looks like:

    Unreclaimable slab info:
    Name Used Total
    rpc_buffers 31KB 31KB
    rpc_tasks 7KB 7KB
    ebitmap_node 1964KB 1964KB
    avtab_node 5024KB 5024KB
    xfs_buf 1402KB 1402KB
    xfs_ili 134KB 134KB
    xfs_efi_item 115KB 115KB
    xfs_efd_item 115KB 115KB
    xfs_buf_item 134KB 134KB
    xfs_log_item_desc 342KB 342KB
    xfs_trans 1412KB 1412KB
    xfs_ifork 212KB 212KB

    [yang.s@alibaba-inc.com: v11]
    Link: http://lkml.kernel.org/r/1507656303-103845-4-git-send-email-yang.s@alibaba-inc.com
    Link: http://lkml.kernel.org/r/1507152550-46205-4-git-send-email-yang.s@alibaba-inc.com
    Signed-off-by: Yang Shi
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

07 Nov, 2017

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

24 Oct, 2017

1 commit

  • READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
    can be used instead of lockless_dereference() without any change in
    semantics.

    Signed-off-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     

10 Aug, 2017

1 commit

  • A while ago someone, and I cannot find the email just now, asked if we
    could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
    like we use for other things like workqueues etc. I think this should
    be possible which allows reducing the 'irq' states and will reduce the
    amount of __bfs() lookups we do.

    Removing the 1 IRQ state results in 4 less __bfs() walks per
    dependency, improving lockdep performance. And by moving this
    annotation out of the lockdep code it becomes easier for the mm people
    to extend.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Nikolay Borisov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: boqun.feng@gmail.com
    Cc: iamjoonsoo.kim@lge.com
    Cc: kernel-team@lge.com
    Cc: kirill@shutemov.name
    Cc: npiggin@gmail.com
    Cc: walken@google.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Jul, 2017

3 commits

  • Josef's redesign of the balancing between slab caches and the page cache
    requires slab cache statistics at the lruvec level.

    Link: http://lkml.kernel.org/r/20170530181724.27197-7-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The kmem-specific functions do the same thing. Switch and drop.

    Link: http://lkml.kernel.org/r/20170530181724.27197-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that the slab counters are moved from the zone to the node level we
    can drop the private memcg node stats and use the official ones.

    Link: http://lkml.kernel.org/r/20170530181724.27197-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner