24 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your optional any later version of the license

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Richard Fontana
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190520075212.713472955@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

21 May, 2019

3 commits

  • Add SPDX license identifiers to all Make/Kconfig files which:

    - Have no license information of any form

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have MODULE_LICENCE("GPL*") inside which was used in the initial
    scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

20 May, 2019

2 commits

  • Merge yet more updates from Andrew Morton:
    "A few final bits:

    - large changes to vmalloc, yielding large performance benefits

    - tweak the console-flush-on-panic code

    - a few fixes"

    * emailed patches from Andrew Morton :
    panic: add an option to replay all the printk message in buffer
    initramfs: don't free a non-existent initrd
    fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
    mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock
    mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro
    mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro
    mm/vmalloc.c: keep track of free blocks for vmap allocation

    Linus Torvalds
     
  • Pull core fixes from Ingo Molnar:
    "This fixes a particularly thorny munmap() bug with MPX, plus fixes a
    host build environment assumption in objtool"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    objtool: Allow AR to be overridden with HOSTAR
    x86/mpx, mm/core: Fix recursive munmap() corruption

    Linus Torvalds
     

19 May, 2019

4 commits

  • syzbot reported the following error from a tree with a head commit of
    baf76f0c58ae ("slip: make slhc_free() silently accept an error pointer")

    BUG: unable to handle kernel paging request at ffffea0003348000
    #PF error: [normal kernel read fault]
    PGD 12c3f9067 P4D 12c3f9067 PUD 12c3f8067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP KASAN
    CPU: 1 PID: 28916 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #89
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:314 [inline]
    RIP: 0010:PageCompound include/linux/page-flags.h:186 [inline]
    RIP: 0010:isolate_freepages_block+0x1c0/0xd40 mm/compaction.c:579
    Code: 01 d8 ff 4d 85 ed 0f 84 ef 07 00 00 e8 29 00 d8 ff 4c 89 e0 83 85 38 ff
    ff ff 01 48 c1 e8 03 42 80 3c 38 00 0f 85 31 0a 00 00 8b 2c 24 31 ff 49
    c1 ed 10 41 83 e5 01 44 89 ee e8 3a 01 d8 ff
    RSP: 0018:ffff88802b31eab8 EFLAGS: 00010246
    RAX: 1ffffd4000669000 RBX: 00000000000cd200 RCX: ffffc9000a235000
    RDX: 000000000001ca5e RSI: ffffffff81988cc7 RDI: 0000000000000001
    RBP: ffff88802b31ebd8 R08: ffff88805af700c0 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0003348000
    R13: 0000000000000000 R14: ffff88802b31f030 R15: dffffc0000000000
    FS: 00007f61648dc700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffea0003348000 CR3: 0000000037c64000 CR4: 00000000001426e0
    Call Trace:
    fast_isolate_around mm/compaction.c:1243 [inline]
    fast_isolate_freepages mm/compaction.c:1418 [inline]
    isolate_freepages mm/compaction.c:1438 [inline]
    compaction_alloc+0x1aee/0x22e0 mm/compaction.c:1550

    There is no reproducer and it is difficult to hit -- 1 crash every few
    days. The issue is very similar to the fix in commit 6b0868c820ff
    ("mm/compaction.c: correct zone boundary handling when resetting pageblock
    skip hints"). When isolating free pages around a target pageblock, the
    boundary handling is off by one and can stray into the next pageblock.
    Triggering the syzbot error requires that the end of pageblock is section
    or zone aligned, and that the next section is unpopulated.

    A more subtle consequence of the bug is that pageblocks were being
    improperly used as migration targets which potentially hurts fragmentation
    avoidance in the long-term one page at a time.

    A debugging patch revealed that it's definitely possible to stray outside
    of a pageblock which is not intended. While syzbot cannot be used to
    verify this patch, it was confirmed that the debugging warning no longer
    triggers with this patch applied. It has also been confirmed that the THP
    allocation stress tests are not degraded by this patch.

    Link: http://lkml.kernel.org/r/20190510182124.GI18914@techsingularity.net
    Fixes: e332f741a8dd ("mm, compaction: be selective about what pageblocks to clear skip hints")
    Signed-off-by: Mel Gorman
    Reported-by: syzbot+d84c80f9fe26a0f7a734@syzkaller.appspotmail.com
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Qian Cai
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: # v5.1+
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This macro adds some debug code to check that vmap allocations are
    happened in ascending order.

    By default this option is set to 0 and not active. It requires
    recompilation of the kernel to activate it. Set to 1, compile the
    kernel.

    [urezki@gmail.com: v4]
    Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
    Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Roman Gushchin
    Cc: Ingo Molnar
    Cc: Joel Fernandes
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Cc: Thomas Garnier
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • This macro adds some debug code to check that the augment tree is
    maintained correctly, meaning that every node contains valid
    subtree_max_size value.

    By default this option is set to 0 and not active. It requires
    recompilation of the kernel to activate it. Set to 1, compile the
    kernel.

    [urezki@gmail.com: v4]
    Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
    Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Roman Gushchin
    Cc: Ingo Molnar
    Cc: Joel Fernandes
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Cc: Thomas Garnier
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Patch series "improve vmap allocation", v3.

    Objective
    ---------

    Please have a look for the description at:

    https://lkml.org/lkml/2018/10/19/786

    but let me also summarize it a bit here as well.

    The current implementation has O(N) complexity. Requests with different
    permissive parameters can lead to long allocation time. When i say
    "long" i mean milliseconds.

    Description
    -----------

    This approach organizes the KVA memory layout into free areas of the
    1-ULONG_MAX range, i.e. an allocation is done over free areas lookups,
    instead of finding a hole between two busy blocks. It allows to have
    lower number of objects which represent the free space, therefore to have
    less fragmented memory allocator. Because free blocks are always as large
    as possible.

    It uses the augment tree where all free areas are sorted in ascending
    order of va->va_start address in pair with linked list that provides
    O(1) access to prev/next elements.

    Since the tree is augment, we also maintain the "subtree_max_size" of VA
    that reflects a maximum available free block in its left or right
    sub-tree. Knowing that, we can easily traversal toward the lowest (left
    most path) free area.

    Allocation: ~O(log(N)) complexity. It is sequential allocation method
    therefore tends to maximize locality. The search is done until a first
    suitable block is large enough to encompass the requested parameters.
    Bigger areas are split.

    I copy paste here the description of how the area is split, since i
    described it in https://lkml.org/lkml/2018/10/19/786

    A free block can be split by three different ways. Their names are
    FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they
    correspond to how requested size and alignment fit to a free block.

    FL_FIT_TYPE - in this case a free block is just removed from the free
    list/tree because it fully fits. Comparing with current design there is
    an extra work with rb-tree updating.

    LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do
    is just cutting a free block. It is as fast as a current design. Most of
    the vmalloc allocations just end up with this case, because the edge is
    always aligned to 1.

    NE_FIT_TYPE - Is much less common case. Basically it happens when
    requested size and alignment does not fit left nor right edges, i.e. it
    is between them. In this case during splitting we have to build a
    remaining left free area and place it back to the free list/tree.

    Comparing with current design there are two extra steps. First one is we
    have to allocate a new vmap_area structure. Second one we have to insert
    that remaining free block to the address sorted list/tree.

    In order to optimize a first case there is a cache with free_vmap objects.
    Instead of allocating from slab we just take an object from the cache and
    reuse it.

    Second one is pretty optimized. Since we know a start point in the tree
    we do not do a search from the top. Instead a traversal begins from a
    rb-tree node we split.

    De-allocation. ~O(log(N)) complexity. An area is not inserted straight
    away to the tree/list, instead we identify the spot first, checking if it
    can be merged around neighbors. The list provides O(1) access to
    prev/next, so it is pretty fast to check it. Summarizing. If merged then
    large coalesced areas are created, if not the area is just linked making
    more fragments.

    There is one more thing that i should mention here. After modification of
    VA node, its subtree_max_size is updated if it was/is the biggest area in
    its left or right sub-tree. Apart of that it can also be populated back
    to upper levels to fix the tree. For more details please have a look at
    the __augment_tree_propagate_from() function and the description.

    Tests and stressing
    -------------------

    I use the "test_vmalloc.sh" test driver available under
    "tools/testing/selftests/vm/" since 5.1-rc1 kernel. Just trigger "sudo
    ./test_vmalloc.sh" to find out how to deal with it.

    Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
    Regarding last one, i do not have any physical access to NUMA system,
    therefore i emulated it. The time of stressing is days.

    If you run the test driver in "stress mode", you also need the patch that
    is in Andrew's tree but not in Linux 5.1-rc1. So, please apply it:

    http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c

    After massive testing, i have not identified any problems like memory
    leaks, crashes or kernel panics. I find it stable, but more testing would
    be good.

    Performance analysis
    --------------------

    I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and
    another is HiKey960(arm64) board. i5-3320M runs on 4.20 kernel, whereas
    Hikey960 uses 4.15 kernel. I have both system which could run on 5.1-rc1
    as well, but the results have not been ready by time i an writing this.

    Currently it consist of 8 tests. There are three of them which correspond
    to different types of splitting(to compare with default). We have 3
    ones(see above). Another 5 do allocations in different conditions.

    a) sudo ./test_vmalloc.sh performance

    When the test driver is run in "performance" mode, it runs all available
    tests pinned to first online CPU with sequential execution test order. We
    do it in order to get stable and repeatable results. Take a look at time
    difference in "long_busy_list_alloc_test". It is not surprising because
    the worst case is O(N).

    # i5-3320M
    How many cycles all tests took:
    CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles

    # See detailed table with results here:
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt

    # Hikey960 8x CPUs
    How many cycles all tests took:
    CPU0=3478683207 cycles vs CPU0=463767978 cycles

    # See detailed table with results here:
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt

    b) time sudo ./test_vmalloc.sh test_repeat_count=1

    With this configuration, all tests are run on all available online CPUs.
    Before running each CPU shuffles its tests execution order. It gives
    random allocation behaviour. So it is rough comparison, but it puts in
    the picture for sure.

    # i5-3320M
    vs
    real 101m22.813s real 0m56.805s
    user 0m0.011s user 0m0.015s
    sys 0m5.076s sys 0m0.023s

    # See detailed table with results here:
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt

    # Hikey960 8x CPUs
    vs
    real unknown real 4m25.214s
    user unknown user 0m0.011s
    sys unknown sys 0m0.670s

    I did not manage to complete this test on "default Hikey960" kernel
    version. After 24 hours it was still running, therefore i had to cancel
    it. That is why real/user/sys are "unknown".

    This patch (of 3):

    Currently an allocation of the new vmap area is done over busy list
    iteration(complexity O(n)) until a suitable hole is found between two busy
    areas. Therefore each new allocation causes the list being grown. Due to
    over fragmented list and different permissive parameters an allocation can
    take a long time. For example on embedded devices it is milliseconds.

    This patch organizes the KVA memory layout into free areas of the
    1-ULONG_MAX range. It uses an augment red-black tree that keeps blocks
    sorted by their offsets in pair with linked list keeping the free space in
    order of increasing addresses.

    Nodes are augmented with the size of the maximum available free block in
    its left or right sub-tree. Thus, that allows to take a decision and
    traversal toward the block that will fit and will have the lowest start
    address, i.e. it is sequential allocation.

    Allocation: to allocate a new block a search is done over the tree until a
    suitable lowest(left most) block is large enough to encompass: the
    requested size, alignment and vstart point. If the block is bigger than
    requested size - it is split.

    De-allocation: when a busy vmap area is freed it can either be merged or
    inserted to the tree. Red-black tree allows efficiently find a spot
    whereas a linked list provides a constant-time access to previous and next
    blocks to check if merging can be done. In case of merging of
    de-allocated memory chunk a large coalesced area is created.

    Complexity: ~O(log(N))

    [urezki@gmail.com: v3]
    Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
    [urezki@gmail.com: v4]
    Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
    Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Thomas Garnier
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Joel Fernandes
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     

17 May, 2019

1 commit

  • It turned out that DEBUG_SLAB_LEAK is still broken even after recent
    recue efforts that when there is a large number of objects like
    kmemleak_object which is normal on a debug kernel,

    # grep kmemleak /proc/slabinfo
    kmemleak_object 2243606 3436210 ...

    reading /proc/slab_allocators could easily loop forever while processing
    the kmemleak_object cache and any additional freeing or allocating
    objects will trigger a reprocessing. To make a situation worse,
    soft-lockups could easily happen in this sitatuion which will call
    printk() to allocate more kmemleak objects to guarantee an infinite
    loop.

    Also, since it seems no one had noticed when it was totally broken
    more than 2-year ago - see the commit fcf88917dd43 ("slab: fix a crash
    by reading /proc/slab_allocators"), probably nobody cares about it
    anymore due to the decline of the SLAB. Just remove it entirely.

    Suggested-by: Vlastimil Babka
    Suggested-by: Linus Torvalds
    Signed-off-by: Qian Cai
    Signed-off-by: Linus Torvalds

    Qian Cai
     

15 May, 2019

29 commits

  • When a cgroup is reclaimed on behalf of a configured limit, reclaim
    needs to round-robin through all NUMA nodes that hold pages of the memcg
    in question. However, when assembling the mask of candidate NUMA nodes,
    the code only consults the *local* cgroup LRU counters, not the
    recursive counters for the entire subtree. Cgroup limits are frequently
    configured against intermediate cgroups that do not have memory on their
    own LRUs. In this case, the node mask will always come up empty and
    reclaim falls back to scanning only the current node.

    If a cgroup subtree has some memory on one node but the processes are
    bound to another node afterwards, the limit reclaim will never age or
    reclaim that memory anymore.

    To fix this, use the recursive LRU counts for a cgroup subtree to
    determine which nodes hold memory of that cgroup.

    The code has been broken like this forever, so it doesn't seem to be a
    problem in practice. I just noticed it while reviewing the way the LRU
    counters are used in general.

    Link: http://lkml.kernel.org/r/20190412151507.2769-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Right now, when somebody needs to know the recursive memory statistics
    and events of a cgroup subtree, they need to walk the entire subtree and
    sum up the counters manually.

    There are two issues with this:

    1. When a cgroup gets deleted, its stats are lost. The state counters
    should all be 0 at that point, of course, but the events are not.
    When this happens, the event counters, which are supposed to be
    monotonic, can go backwards in the parent cgroups.

    2. During regular operation, we always have a certain number of lazily
    freed cgroups sitting around that have been deleted, have no tasks,
    but have a few cache pages remaining. These groups' statistics do not
    change until we eventually hit memory pressure, but somebody
    watching, say, memory.stat on an ancestor has to iterate those every
    time.

    This patch addresses both issues by introducing recursive counters at
    each level that are propagated from the write side when stats change.

    Upward propagation happens when the per-cpu caches spill over into the
    local atomic counter. This is the same thing we do during charge and
    uncharge, except that the latter uses atomic RMWs, which are more
    expensive; stat changes happen at around the same rate. In a sparse
    file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
    nesting levels, perf shows __mod_memcg_page state at ~1%.

    Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These are getting too big to be inlined in every callsite. They were
    stolen from vmstat.c, which already out-of-lines them, and they have
    only been growing since. The callsites aren't that hot, either.

    Move __mod_memcg_state()
    __mod_lruvec_state() and
    __count_memcg_events() out of line and add kerneldoc comments.

    Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "mm: memcontrol: memory.stat cost & correctness".

    The cgroup memory.stat file holds recursive statistics for the entire
    subtree. The current implementation does this tree walk on-demand
    whenever the file is read. This is giving us problems in production.

    1. The cost of aggregating the statistics on-demand is high. A lot of
    system service cgroups are mostly idle and their stats don't change
    between reads, yet we always have to check them. There are also always
    some lazily-dying cgroups sitting around that are pinned by a handful
    of remaining page cache; the same applies to them.

    In an application that periodically monitors memory.stat in our
    fleet, we have seen the aggregation consume up to 5% CPU time.

    2. When cgroups die and disappear from the cgroup tree, so do their
    accumulated vm events. The result is that the event counters at
    higher-level cgroups can go backwards and confuse some of our
    automation, let alone people looking at the graphs over time.

    To address both issues, this patch series changes the stat
    implementation to spill counts upwards when the counters change.

    The upward spilling is batched using the existing per-cpu cache. In a
    sparse file stress test with 5 level cgroup nesting, the additional cost
    of the flushing was negligible (a little under 1% of CPU at 100% CPU
    utilization, compared to the 5% of reading memory.stat during regular
    operation).

    This patch (of 4):

    memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
    currently returning the state of the local memcg or lruvec, not the
    recursive state.

    In practice there is a demand for both versions, although the callers
    that want the recursive counts currently sum them up by hand.

    Per default, cgroups are considered recursive entities and generally we
    expect more users of the recursive counters, with the local counts being
    special cases. To reflect that in the name, add a _local suffix to the
    current implementations.

    The following patch will re-incarnate these functions with recursive
    semantics, but with an O(1) implementation.

    [hannes@cmpxchg.org: fix bisection hole]
    Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
    Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • I spent literally an hour trying to work out why an earlier version of
    my memory.events aggregation code doesn't work properly, only to find
    out I was calling memcg->events instead of memcg->memory_events, which
    is fairly confusing.

    This naming seems in need of reworking, so make it harder to do the
    wrong thing by using vmevents instead of events, which makes it more
    clear that these are vm counters rather than memcg-specific counters.

    There are also a few other inconsistent names in both the percpu and
    aggregated structs, so these are all cleaned up to be more coherent and
    easy to understand.

    This commit contains code cleanup only: there are no logic changes.

    [akpm@linux-foundation.org: fix it for preceding changes]
    Link: http://lkml.kernel.org/r/20190208224319.GA23801@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Cc: Dennis Zhou
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • The semantics of what mincore() considers to be resident is not
    completely clear, but Linux has always (since 2.3.52, which is when
    mincore() was initially done) treated it as "page is available in page
    cache".

    That's potentially a problem, as that [in]directly exposes
    meta-information about pagecache / memory mapping state even about
    memory not strictly belonging to the process executing the syscall,
    opening possibilities for sidechannel attacks.

    Change the semantics of mincore() so that it only reveals pagecache
    information for non-anonymous mappings that belog to files that the
    calling process could (if it tried to) successfully open for writing;
    otherwise we'd be including shared non-exclusive mappings, which

    - is the sidechannel

    - is not the usecase for mincore(), as that's primarily used for data,
    not (shared) text

    [jkosina@suse.cz: v2]
    Link: http://lkml.kernel.org/r/20190312141708.6652-2-vbabka@suse.cz
    [mhocko@suse.com: restructure can_do_mincore() conditions]
    Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1903062342020.19912@cbobk.fhfr.pm
    Signed-off-by: Jiri Kosina
    Signed-off-by: Vlastimil Babka
    Acked-by: Josh Snyder
    Acked-by: Michal Hocko
    Originally-by: Linus Torvalds
    Originally-by: Dominique Martinet
    Cc: Andy Lutomirski
    Cc: Dave Chinner
    Cc: Kevin Easton
    Cc: Matthew Wilcox
    Cc: Cyril Hrubis
    Cc: Tejun Heo
    Cc: Kirill A. Shutemov
    Cc: Daniel Gruss
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • When freeing a page with an order >= shuffle_page_order randomly select
    the front or back of the list for insertion.

    While the mm tries to defragment physical pages into huge pages this can
    tend to make the page allocator more predictable over time. Inject the
    front-back randomness to preserve the initial randomness established by
    shuffle_free_memory() when the kernel was booted.

    The overhead of this manipulation is constrained by only being applied
    for MAX_ORDER sized pages by default.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Kees Cook
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • In preparation for runtime randomization of the zone lists, take all
    (well, most of) the list_*() functions in the buddy allocator and put
    them in helper functions. Provide a common control point for injecting
    additional behavior when freeing pages.

    [dan.j.williams@intel.com: fix buddy list helpers]
    Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
    [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
    Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
    Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Vlastimil Babka
    Tested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Kees Cook
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm: Randomize free memory", v10.

    This patch (of 3):

    Randomization of the page allocator improves the average utilization of
    a direct-mapped memory-side-cache. Memory side caching is a platform
    capability that Linux has been previously exposed to in HPC
    (high-performance computing) environments on specialty platforms. In
    that instance it was a smaller pool of high-bandwidth-memory relative to
    higher-capacity / lower-bandwidth DRAM. Now, this capability is going
    to be found on general purpose server platforms where DRAM is a cache in
    front of higher latency persistent memory [1].

    Robert offered an explanation of the state of the art of Linux
    interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel). That's better than forcing
    users to deploy remedies like:
    "To eliminate this gradual degradation, we have added a Stream
    measurement to the Node Health Check that follows each job;
    nodes are rebooted whenever their measured memory bandwidth
    falls below 300 GB/s."

    A replacement for zonesort was merged upstream in commit cc9aec03e58f
    ("x86/numa_emulation: Introduce uniform split capability"). With this
    numa_emulation capability, memory can be split into cache sized
    ("near-memory" sized) numa nodes. A bind operation to such a node, and
    disabling workloads on other nodes, enables full cache performance.
    However, once the workload exceeds the cache size then cache conflicts
    are unavoidable. While HPC environments might be able to tolerate
    time-scheduling of cache sized workloads, for general purpose server
    platforms, the oversubscribed cache case will be the common case.

    The worst case scenario is that a server system owner benchmarks a
    workload at boot with an un-contended cache only to see that performance
    degrade over time, even below the average cache performance due to
    excessive conflicts. Randomization clips the peaks and fills in the
    valleys of cache utilization to yield steady average performance.

    Here are some performance impact details of the patches:

    1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
    3X speedup in a contrived case that tries to force cache conflicts.
    The contrived cased used the numa_emulation capability to force an
    instance of the benchmark to be run in two of the near-memory sized
    numa nodes. If both instances were placed on the same emulated they
    would fit and cause zero conflicts. While on separate emulated nodes
    without randomization they underutilized the cache and conflicted
    unnecessarily due to the in-order allocation per node.

    2/ A well known Java server application benchmark was run with a heap
    size that exceeded cache size by 3X. The cache conflict rate was 8%
    for the first run and degraded to 21% after page allocator aging. With
    randomization enabled the rate levelled out at 11%.

    3/ A MongoDB workload did not observe measurable difference in
    cache-conflict rates, but the overall throughput dropped by 7% with
    randomization in one case.

    4/ Mel Gorman ran his suite of performance workloads with randomization
    enabled on platforms without a memory-side-cache and saw a mix of some
    improvements and some losses [3].

    While there is potentially significant improvement for applications that
    depend on low latency access across a wide working-set, the performance
    may be negligible to negative for other workloads. For this reason the
    shuffle capability defaults to off unless a direct-mapped
    memory-side-cache is detected. Even then, the page_alloc.shuffle=0
    parameter can be specified to disable the randomization on those systems.

    Outside of memory-side-cache utilization concerns there is potentially
    security benefit from randomization. Some data exfiltration and
    return-oriented-programming attacks rely on the ability to infer the
    location of sensitive data objects. The kernel page allocator, especially
    early in system boot, has predictable first-in-first out behavior for
    physical pages. Pages are freed in physical address order when first
    onlined.

    Quoting Kees:
    "While we already have a base-address randomization
    (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
    memory layouts would certainly be using the predictability of
    allocation ordering (i.e. for attacks where the base address isn't
    important: only the relative positions between allocated memory).
    This is common in lots of heap-style attacks. They try to gain
    control over ordering by spraying allocations, etc.

    I'd really like to see this because it gives us something similar
    to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

    While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
    caches it leaves vast bulk of memory to be predictably in order allocated.
    However, it should be noted, the concrete security benefits are hard to
    quantify, and no known CVE is mitigated by this randomization.

    Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
    a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
    are initially populated with free memory at boot and at hotplug time. Do
    this based on either the presence of a page_alloc.shuffle=Y command line
    parameter, or autodetection of a memory-side-cache (to be added in a
    follow-on patch).

    The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
    pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
    4MB this trades off randomization granularity for time spent shuffling.
    MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
    while still showing memory-side cache behavior improvements, and the
    expectation that the security implications of finer granularity
    randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
    performance impact of the shuffling appears to be in the noise compared to
    other memory initialization work.

    This initial randomization can be undone over time so a follow-on patch is
    introduced to inject entropy on page free decisions. It is reasonable to
    ask if the page free entropy is sufficient, but it is not enough due to
    the in-order initial freeing of pages. At the start of that process
    putting page1 in front or behind page0 still keeps them close together,
    page2 is still near page1 and has a high chance of being adjacent. As
    more pages are added ordering diversity improves, but there is still high
    page locality for the low address pages and this leads to no significant
    impact to the cache conflict rate.

    [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
    [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
    [3]: https://lkml.org/lkml/2018/10/12/309

    [dan.j.williams@intel.com: fix shuffle enable]
    Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
    [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
    Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Qian Cai
    Reviewed-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on
    both 32 and 64 bit systems. lazy_max_pages() deals with "unsigned long"
    that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on
    64 bit as well.

    Link: http://lkml.kernel.org/r/20190131162452.25879-1-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Andrew Morton
    Reviewed-by: William Kucharski
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Thomas Garnier
    Cc: Oleksiy Avramchenko
    Cc: Joel Fernandes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Commit 763b218ddfaf ("mm: add preempt points into __purge_vmap_area_lazy()")
    introduced some preempt points, one of those is making an allocation
    more prioritized over lazy free of vmap areas.

    Prioritizing an allocation over freeing does not work well all the time,
    i.e. it should be rather a compromise.

    1) Number of lazy pages directly influences the busy list length thus
    on operations like: allocation, lookup, unmap, remove, etc.

    2) Under heavy stress of vmalloc subsystem I run into a situation when
    memory usage gets increased hitting out_of_memory -> panic state due to
    completely blocking of logic that frees vmap areas in the
    __purge_vmap_area_lazy() function.

    Establish a threshold passing which the freeing is prioritized back over
    allocation creating a balance between each other.

    Using vmalloc test driver in "stress mode", i.e. When all available
    test cases are run simultaneously on all online CPUs applying a
    pressure on the vmalloc subsystem, my HiKey 960 board runs out of
    memory due to the fact that __purge_vmap_area_lazy() logic simply is
    not able to free pages in time.

    How I run it:

    1) You should build your kernel with CONFIG_TEST_VMALLOC=m
    2) ./tools/testing/selftests/vm/test_vmalloc.sh stress

    During this test "vmap_lazy_nr" pages will go far beyond acceptable
    lazy_max_pages() threshold, that will lead to enormous busy list size
    and other problems including allocation time and so on.

    Link: http://lkml.kernel.org/r/20190124115648.9433-3-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Thomas Garnier
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Joel Fernandes
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Cc: Joel Fernandes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Commit 0139aa7b7fa ("mm: rename _count, field of the struct page, to
    _refcount") left out a couple of references to the old field name. Fix
    that.

    Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il
    Fixes: 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount")
    Signed-off-by: Baruch Siach
    Reviewed-by: Andrew Morton
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baruch Siach
     
  • Merge misc updates from Andrew Morton:

    - a few misc things and hotfixes

    - ocfs2

    - almost all of MM

    * emailed patches from Andrew Morton : (139 commits)
    kernel/memremap.c: remove the unused device_private_entry_fault() export
    mm: delete find_get_entries_tag
    mm/huge_memory.c: make __thp_get_unmapped_area static
    mm/mprotect.c: fix compilation warning because of unused 'mm' variable
    mm/page-writeback: introduce tracepoint for wait_on_page_writeback()
    mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags
    mm/Kconfig: update "Memory Model" help text
    mm/vmscan.c: don't disable irq again when count pgrefill for memcg
    mm: memblock: make keeping memblock memory opt-in rather than opt-out
    hugetlbfs: always use address space in inode for resv_map pointer
    mm/z3fold.c: support page migration
    mm/z3fold.c: add structure for buddy handles
    mm/z3fold.c: improve compression by extending search
    mm/z3fold.c: introduce helper functions
    mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist
    mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig
    mm/vmscan.c: simplify shrink_inactive_list()
    fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
    xen/privcmd-buf.c: convert to use vm_map_pages_zero()
    xen/gntdev.c: convert to use vm_map_pages()
    ...

    Linus Torvalds
     
  • I removed the only user of this and hadn't noticed it was now unused.

    Link: http://lkml.kernel.org/r/20190430152929.21813-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • __thp_get_unmapped_area is only used in mm/huge_memory.c. Make it static.
    Tested by building and booting the kernel.

    Link: http://lkml.kernel.org/r/20190504102353.GA22525@bharath12345-Inspiron-5559
    Signed-off-by: Bharath Vedartham
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bharath Vedartham
     
  • Since 0cbe3e26abe0 ("mm: update ptep_modify_prot_start/commit to take
    vm_area_struct as arg") the only place that uses the local 'mm' variable
    in change_pte_range() is the call to set_pte_at().

    Many architectures define set_pte_at() as macro that does not use the 'mm'
    parameter, which generates the following compilation warning:

    CC mm/mprotect.o
    mm/mprotect.c: In function 'change_pte_range':
    mm/mprotect.c:42:20: warning: unused variable 'mm' [-Wunused-variable]
    struct mm_struct *mm = vma->vm_mm;
    ^~

    Fix it by passing vma->mm to set_pte_at() and dropping the local 'mm'
    variable in change_pte_range().

    [liu.song.a23@gmail.com: fix missed conversions]
    Link: http://lkml.kernel.org/r/CAPhsuW6wcQgYLHNdBdw6m0YiR4RWsS4XzfpSKU7wBLLeOCTbpw@mail.gmail.comLink: http://lkml.kernel.org/r/1557305432-4940-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Recently there have been some hung tasks on our server due to
    wait_on_page_writeback(), and we want to know the details of this
    PG_writeback, i.e. this page is writing back to which device. But it is
    not so convenient to get the details.

    I think it would be better to introduce a tracepoint for diagnosing the
    writeback details.

    Link: http://lkml.kernel.org/r/1556274402-19018-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Reviewed-by: Andrew Morton
    Cc: Jan Kara
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • The help describing the memory model selection is outdated. It still says
    that SPARSEMEM is experimental and DISCONTIGMEM is a preferred over
    SPARSEMEM.

    Update the help text for the relevant options:
    * add a generic help for the "Memory Model" prompt
    * add description for FLATMEM
    * reduce the description of DISCONTIGMEM and add a deprecation note
    * prefer SPARSEMEM over DISCONTIGMEM

    Link: http://lkml.kernel.org/r/1556188531-20728-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • We can use __count_memcg_events() directly because this callsite is alreay
    protected by spin_lock_irq().

    Link: http://lkml.kernel.org/r/1556093494-30798-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • Most architectures do not need the memblock memory after the page
    allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
    arch Kconfig.

    Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
    logic makes it clear which architectures actually use memblock after
    system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK
    to the architectures that are still missing that option.

    Link: http://lkml.kernel.org/r/1556102150-32517-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michael Ellerman (powerpc)
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Richard Kuo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Ralf Baechle
    Cc: Paul Burton
    Cc: James Hogan
    Cc: Ley Foon Tan
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Eric Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Continuing discussion about 58b6e5e8f1ad ("hugetlbfs: fix memory leak for
    resv_map") brought up the issue that inode->i_mapping may not point to the
    address space embedded within the inode at inode eviction time. The
    hugetlbfs truncate routine handles this by explicitly using inode->i_data.
    However, code cleaning up the resv_map will still use the address space
    pointed to by inode->i_mapping. Luckily, private_data is NULL for address
    spaces in all such cases today but, there is no guarantee this will
    continue.

    Change all hugetlbfs code getting a resv_map pointer to explicitly get it
    from the address space embedded within the inode. In addition, add more
    comments in the code to indicate why this is being done.

    Link: http://lkml.kernel.org/r/20190419204435.16984-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Yufen Yu
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Now that we are not using page address in handles directly, we can make
    z3fold pages movable to decrease the memory fragmentation z3fold may
    create over time.

    This patch starts advertising non-headless z3fold pages as movable and
    uses the existing kernel infrastructure to implement moving of such pages
    per memory management subsystem's request. It thus implements 3 required
    callbacks for page migration:

    * isolation callback: z3fold_page_isolate(): try to isolate the page by
    removing it from all lists. Pages scheduled for some activity and
    mapped pages will not be isolated. Return true if isolation was
    successful or false otherwise

    * migration callback: z3fold_page_migrate(): re-check critical
    conditions and migrate page contents to the new page provided by the
    memory subsystem. Returns 0 on success or negative error code otherwise

    * putback callback: z3fold_page_putback(): put back the page if
    z3fold_page_migrate() for it failed permanently (i. e. not with
    -EAGAIN code).

    [lkp@intel.com: z3fold_page_isolate() can be static]
    Link: http://lkml.kernel.org/r/20190419130924.GA161478@ivb42
    Link: http://lkml.kernel.org/r/20190417103922.31253da5c366c4ebe0419cfc@gmail.com
    Signed-off-by: Vitaly Wool
    Signed-off-by: kbuild test robot
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Dan Streetman
    Cc: Krzysztof Kozlowski
    Cc: Oleksiy Avramchenko
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • For z3fold to be able to move its pages per request of the memory
    subsystem, it should not use direct object addresses in handles. Instead,
    it will create abstract handles (3 per page) which will contain pointers
    to z3fold objects. Thus, it will be possible to change these pointers
    when z3fold page is moved.

    Link: http://lkml.kernel.org/r/20190417103826.484eaf18c1294d682769880f@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Dan Streetman
    Cc: Krzysztof Kozlowski
    Cc: Oleksiy Avramchenko
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • The current z3fold implementation only searches this CPU's page lists for
    a fitting page to put a new object into. This patch adds quick search for
    very well fitting pages (i. e. those having exactly the required number
    of free space) on other CPUs too, before allocating a new page for that
    object.

    Link: http://lkml.kernel.org/r/20190417103733.72ae81abe1552397c95a008e@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Dan Streetman
    Cc: Krzysztof Kozlowski
    Cc: Oleksiy Avramchenko
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • Patch series "z3fold: support page migration", v2.

    This patchset implements page migration support and slightly better buddy
    search. To implement page migration support, z3fold has to move away from
    the current scheme of handle encoding. i. e. stop encoding page address
    in handles. Instead, a small per-page structure is created which will
    contain actual addresses for z3fold objects, while pointers to fields of
    that structure will be used as handles.

    Thus, it will be possible to change the underlying addresses to reflect
    page migration.

    To support migration itself, 3 callbacks will be implemented:

    1: isolation callback: z3fold_page_isolate(): try to isolate the page
    by removing it from all lists. Pages scheduled for some activity and
    mapped pages will not be isolated. Return true if isolation was
    successful or false otherwise

    2: migration callback: z3fold_page_migrate(): re-check critical
    conditions and migrate page contents to the new page provided by the
    system. Returns 0 on success or negative error code otherwise

    3: putback callback: z3fold_page_putback(): put back the page if
    z3fold_page_migrate() for it failed permanently (i. e. not with
    -EAGAIN code).

    To make sure an isolated page doesn't get freed, its kref is incremented
    in z3fold_page_isolate() and decremented during post-migration compaction,
    if migration was successful, or by z3fold_page_putback() in the other
    case.

    Since the new handle encoding scheme implies slight memory consumption
    increase, better buddy search (which decreases memory consumption) is
    included in this patchset.

    This patch (of 4):

    Introduce a separate helper function for object allocation, as well as 2
    smaller helpers to add a buddy to the list and to get a pointer to the
    pool from the z3fold header. No functional changes here.

    Link: http://lkml.kernel.org/r/20190417103633.a4bb770b5bf0fb7e43ce1666@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Dan Streetman
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Krzysztof Kozlowski
    Cc: Oleksiy Avramchenko
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • Because rmqueue_pcplist() is only called when order is 0, we don't need to
    use order as a parameter.

    Link: http://lkml.kernel.org/r/1555591709-11744-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • Add 2 new Kconfig variables that are not used by anyone. I check that
    various make ARCH=somearch allmodconfig do work and do not complain. This
    new Kconfig needs to be added first so that device drivers that depend on
    HMM can be updated.

    Once drivers are updated then I can update the HMM Kconfig to depend on
    this new Kconfig in a followup patch.

    This is about solving Kconfig for HMM given that device driver are
    going through their own tree we want to avoid changing them from the mm
    tree. So plan is:

    1 - Kernel release N add the new Kconfig to mm/Kconfig (this patch)
    2 - Kernel release N+1 update driver to depend on new Kconfig ie
    stop using ARCH_HASH_HMM and start using ARCH_HAS_HMM_MIRROR
    and ARCH_HAS_HMM_DEVICE (one or the other or both depending
    on the driver)
    3 - Kernel release N+2 remove ARCH_HASH_HMM and do final Kconfig
    update in mm/Kconfig

    Link: http://lkml.kernel.org/r/20190417211141.17580-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Guenter Roeck
    Cc: Leon Romanovsky
    Cc: Jason Gunthorpe
    Cc: Ralph Campbell
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This merges together duplicated patterns of code. Also, replace
    count_memcg_events() with its irq-careless namesake, because they are
    already called in interrupts disabled context.

    Link: http://lkml.kernel.org/r/2ece1df4-2989-bc9b-6172-61e9fdde5bfd@virtuozzo.com
    Signed-off-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Reviewed-by: Daniel Jordan
    Acked-by: Johannes Weiner
    Cc: Baoquan He
    Cc: Davidlohr Bueso

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Patch series "mm: Use vm_map_pages() and vm_map_pages_zero() API", v5.

    This patch (of 5):

    Previouly drivers have their own way of mapping range of kernel
    pages/memory into user vma and this was done by invoking vm_insert_page()
    within a loop.

    As this pattern is common across different drivers, it can be generalized
    by creating new functions and using them across the drivers.

    vm_map_pages() is the API which can be used to map kernel memory/pages in
    drivers which have considered vm_pgoff

    vm_map_pages_zero() is the API which can be used to map a range of kernel
    memory/pages in drivers which have not considered vm_pgoff. vm_pgoff is
    passed as default 0 for those drivers.

    We _could_ then at a later "fix" these drivers which are using
    vm_map_pages_zero() to behave according to the normal vm_pgoff offsetting
    simply by removing the _zero suffix on the function name and if that
    causes regressions, it gives us an easy way to revert.

    Tested on Rockchip hardware and display is working, including talking to
    Lima via prime.

    Link: http://lkml.kernel.org/r/751cb8a0f4c3e67e95c58a3b072937617f338eea.1552921225.git.jrdr.linux@gmail.com
    Signed-off-by: Souptick Joarder
    Suggested-by: Russell King
    Suggested-by: Matthew Wilcox
    Reviewed-by: Mike Rapoport
    Tested-by: Heiko Stuebner
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: Robin Murphy
    Cc: Joonsoo Kim
    Cc: Thierry Reding
    Cc: Kees Cook
    Cc: Marek Szyprowski
    Cc: Stefan Richter
    Cc: Sandy Huang
    Cc: David Airlie
    Cc: Oleksandr Andrushchenko
    Cc: Joerg Roedel
    Cc: Pawel Osciak
    Cc: Kyungmin Park
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Mauro Carvalho Chehab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder