17 May, 2019

1 commit

  • It turned out that DEBUG_SLAB_LEAK is still broken even after recent
    recue efforts that when there is a large number of objects like
    kmemleak_object which is normal on a debug kernel,

    # grep kmemleak /proc/slabinfo
    kmemleak_object 2243606 3436210 ...

    reading /proc/slab_allocators could easily loop forever while processing
    the kmemleak_object cache and any additional freeing or allocating
    objects will trigger a reprocessing. To make a situation worse,
    soft-lockups could easily happen in this sitatuion which will call
    printk() to allocate more kmemleak objects to guarantee an infinite
    loop.

    Also, since it seems no one had noticed when it was totally broken
    more than 2-year ago - see the commit fcf88917dd43 ("slab: fix a crash
    by reading /proc/slab_allocators"), probably nobody cares about it
    anymore due to the decline of the SLAB. Just remove it entirely.

    Suggested-by: Vlastimil Babka
    Suggested-by: Linus Torvalds
    Signed-off-by: Qian Cai
    Signed-off-by: Linus Torvalds

    Qian Cai
     

15 May, 2019

39 commits

  • When a cgroup is reclaimed on behalf of a configured limit, reclaim
    needs to round-robin through all NUMA nodes that hold pages of the memcg
    in question. However, when assembling the mask of candidate NUMA nodes,
    the code only consults the *local* cgroup LRU counters, not the
    recursive counters for the entire subtree. Cgroup limits are frequently
    configured against intermediate cgroups that do not have memory on their
    own LRUs. In this case, the node mask will always come up empty and
    reclaim falls back to scanning only the current node.

    If a cgroup subtree has some memory on one node but the processes are
    bound to another node afterwards, the limit reclaim will never age or
    reclaim that memory anymore.

    To fix this, use the recursive LRU counts for a cgroup subtree to
    determine which nodes hold memory of that cgroup.

    The code has been broken like this forever, so it doesn't seem to be a
    problem in practice. I just noticed it while reviewing the way the LRU
    counters are used in general.

    Link: http://lkml.kernel.org/r/20190412151507.2769-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Right now, when somebody needs to know the recursive memory statistics
    and events of a cgroup subtree, they need to walk the entire subtree and
    sum up the counters manually.

    There are two issues with this:

    1. When a cgroup gets deleted, its stats are lost. The state counters
    should all be 0 at that point, of course, but the events are not.
    When this happens, the event counters, which are supposed to be
    monotonic, can go backwards in the parent cgroups.

    2. During regular operation, we always have a certain number of lazily
    freed cgroups sitting around that have been deleted, have no tasks,
    but have a few cache pages remaining. These groups' statistics do not
    change until we eventually hit memory pressure, but somebody
    watching, say, memory.stat on an ancestor has to iterate those every
    time.

    This patch addresses both issues by introducing recursive counters at
    each level that are propagated from the write side when stats change.

    Upward propagation happens when the per-cpu caches spill over into the
    local atomic counter. This is the same thing we do during charge and
    uncharge, except that the latter uses atomic RMWs, which are more
    expensive; stat changes happen at around the same rate. In a sparse
    file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
    nesting levels, perf shows __mod_memcg_page state at ~1%.

    Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These are getting too big to be inlined in every callsite. They were
    stolen from vmstat.c, which already out-of-lines them, and they have
    only been growing since. The callsites aren't that hot, either.

    Move __mod_memcg_state()
    __mod_lruvec_state() and
    __count_memcg_events() out of line and add kerneldoc comments.

    Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "mm: memcontrol: memory.stat cost & correctness".

    The cgroup memory.stat file holds recursive statistics for the entire
    subtree. The current implementation does this tree walk on-demand
    whenever the file is read. This is giving us problems in production.

    1. The cost of aggregating the statistics on-demand is high. A lot of
    system service cgroups are mostly idle and their stats don't change
    between reads, yet we always have to check them. There are also always
    some lazily-dying cgroups sitting around that are pinned by a handful
    of remaining page cache; the same applies to them.

    In an application that periodically monitors memory.stat in our
    fleet, we have seen the aggregation consume up to 5% CPU time.

    2. When cgroups die and disappear from the cgroup tree, so do their
    accumulated vm events. The result is that the event counters at
    higher-level cgroups can go backwards and confuse some of our
    automation, let alone people looking at the graphs over time.

    To address both issues, this patch series changes the stat
    implementation to spill counts upwards when the counters change.

    The upward spilling is batched using the existing per-cpu cache. In a
    sparse file stress test with 5 level cgroup nesting, the additional cost
    of the flushing was negligible (a little under 1% of CPU at 100% CPU
    utilization, compared to the 5% of reading memory.stat during regular
    operation).

    This patch (of 4):

    memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
    currently returning the state of the local memcg or lruvec, not the
    recursive state.

    In practice there is a demand for both versions, although the callers
    that want the recursive counts currently sum them up by hand.

    Per default, cgroups are considered recursive entities and generally we
    expect more users of the recursive counters, with the local counts being
    special cases. To reflect that in the name, add a _local suffix to the
    current implementations.

    The following patch will re-incarnate these functions with recursive
    semantics, but with an O(1) implementation.

    [hannes@cmpxchg.org: fix bisection hole]
    Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
    Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • I spent literally an hour trying to work out why an earlier version of
    my memory.events aggregation code doesn't work properly, only to find
    out I was calling memcg->events instead of memcg->memory_events, which
    is fairly confusing.

    This naming seems in need of reworking, so make it harder to do the
    wrong thing by using vmevents instead of events, which makes it more
    clear that these are vm counters rather than memcg-specific counters.

    There are also a few other inconsistent names in both the percpu and
    aggregated structs, so these are all cleaned up to be more coherent and
    easy to understand.

    This commit contains code cleanup only: there are no logic changes.

    [akpm@linux-foundation.org: fix it for preceding changes]
    Link: http://lkml.kernel.org/r/20190208224319.GA23801@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Cc: Dennis Zhou
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • The semantics of what mincore() considers to be resident is not
    completely clear, but Linux has always (since 2.3.52, which is when
    mincore() was initially done) treated it as "page is available in page
    cache".

    That's potentially a problem, as that [in]directly exposes
    meta-information about pagecache / memory mapping state even about
    memory not strictly belonging to the process executing the syscall,
    opening possibilities for sidechannel attacks.

    Change the semantics of mincore() so that it only reveals pagecache
    information for non-anonymous mappings that belog to files that the
    calling process could (if it tried to) successfully open for writing;
    otherwise we'd be including shared non-exclusive mappings, which

    - is the sidechannel

    - is not the usecase for mincore(), as that's primarily used for data,
    not (shared) text

    [jkosina@suse.cz: v2]
    Link: http://lkml.kernel.org/r/20190312141708.6652-2-vbabka@suse.cz
    [mhocko@suse.com: restructure can_do_mincore() conditions]
    Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1903062342020.19912@cbobk.fhfr.pm
    Signed-off-by: Jiri Kosina
    Signed-off-by: Vlastimil Babka
    Acked-by: Josh Snyder
    Acked-by: Michal Hocko
    Originally-by: Linus Torvalds
    Originally-by: Dominique Martinet
    Cc: Andy Lutomirski
    Cc: Dave Chinner
    Cc: Kevin Easton
    Cc: Matthew Wilcox
    Cc: Cyril Hrubis
    Cc: Tejun Heo
    Cc: Kirill A. Shutemov
    Cc: Daniel Gruss
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • When freeing a page with an order >= shuffle_page_order randomly select
    the front or back of the list for insertion.

    While the mm tries to defragment physical pages into huge pages this can
    tend to make the page allocator more predictable over time. Inject the
    front-back randomness to preserve the initial randomness established by
    shuffle_free_memory() when the kernel was booted.

    The overhead of this manipulation is constrained by only being applied
    for MAX_ORDER sized pages by default.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Kees Cook
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • In preparation for runtime randomization of the zone lists, take all
    (well, most of) the list_*() functions in the buddy allocator and put
    them in helper functions. Provide a common control point for injecting
    additional behavior when freeing pages.

    [dan.j.williams@intel.com: fix buddy list helpers]
    Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
    [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
    Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
    Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Vlastimil Babka
    Tested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Kees Cook
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm: Randomize free memory", v10.

    This patch (of 3):

    Randomization of the page allocator improves the average utilization of
    a direct-mapped memory-side-cache. Memory side caching is a platform
    capability that Linux has been previously exposed to in HPC
    (high-performance computing) environments on specialty platforms. In
    that instance it was a smaller pool of high-bandwidth-memory relative to
    higher-capacity / lower-bandwidth DRAM. Now, this capability is going
    to be found on general purpose server platforms where DRAM is a cache in
    front of higher latency persistent memory [1].

    Robert offered an explanation of the state of the art of Linux
    interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel). That's better than forcing
    users to deploy remedies like:
    "To eliminate this gradual degradation, we have added a Stream
    measurement to the Node Health Check that follows each job;
    nodes are rebooted whenever their measured memory bandwidth
    falls below 300 GB/s."

    A replacement for zonesort was merged upstream in commit cc9aec03e58f
    ("x86/numa_emulation: Introduce uniform split capability"). With this
    numa_emulation capability, memory can be split into cache sized
    ("near-memory" sized) numa nodes. A bind operation to such a node, and
    disabling workloads on other nodes, enables full cache performance.
    However, once the workload exceeds the cache size then cache conflicts
    are unavoidable. While HPC environments might be able to tolerate
    time-scheduling of cache sized workloads, for general purpose server
    platforms, the oversubscribed cache case will be the common case.

    The worst case scenario is that a server system owner benchmarks a
    workload at boot with an un-contended cache only to see that performance
    degrade over time, even below the average cache performance due to
    excessive conflicts. Randomization clips the peaks and fills in the
    valleys of cache utilization to yield steady average performance.

    Here are some performance impact details of the patches:

    1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
    3X speedup in a contrived case that tries to force cache conflicts.
    The contrived cased used the numa_emulation capability to force an
    instance of the benchmark to be run in two of the near-memory sized
    numa nodes. If both instances were placed on the same emulated they
    would fit and cause zero conflicts. While on separate emulated nodes
    without randomization they underutilized the cache and conflicted
    unnecessarily due to the in-order allocation per node.

    2/ A well known Java server application benchmark was run with a heap
    size that exceeded cache size by 3X. The cache conflict rate was 8%
    for the first run and degraded to 21% after page allocator aging. With
    randomization enabled the rate levelled out at 11%.

    3/ A MongoDB workload did not observe measurable difference in
    cache-conflict rates, but the overall throughput dropped by 7% with
    randomization in one case.

    4/ Mel Gorman ran his suite of performance workloads with randomization
    enabled on platforms without a memory-side-cache and saw a mix of some
    improvements and some losses [3].

    While there is potentially significant improvement for applications that
    depend on low latency access across a wide working-set, the performance
    may be negligible to negative for other workloads. For this reason the
    shuffle capability defaults to off unless a direct-mapped
    memory-side-cache is detected. Even then, the page_alloc.shuffle=0
    parameter can be specified to disable the randomization on those systems.

    Outside of memory-side-cache utilization concerns there is potentially
    security benefit from randomization. Some data exfiltration and
    return-oriented-programming attacks rely on the ability to infer the
    location of sensitive data objects. The kernel page allocator, especially
    early in system boot, has predictable first-in-first out behavior for
    physical pages. Pages are freed in physical address order when first
    onlined.

    Quoting Kees:
    "While we already have a base-address randomization
    (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
    memory layouts would certainly be using the predictability of
    allocation ordering (i.e. for attacks where the base address isn't
    important: only the relative positions between allocated memory).
    This is common in lots of heap-style attacks. They try to gain
    control over ordering by spraying allocations, etc.

    I'd really like to see this because it gives us something similar
    to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

    While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
    caches it leaves vast bulk of memory to be predictably in order allocated.
    However, it should be noted, the concrete security benefits are hard to
    quantify, and no known CVE is mitigated by this randomization.

    Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
    a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
    are initially populated with free memory at boot and at hotplug time. Do
    this based on either the presence of a page_alloc.shuffle=Y command line
    parameter, or autodetection of a memory-side-cache (to be added in a
    follow-on patch).

    The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
    pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
    4MB this trades off randomization granularity for time spent shuffling.
    MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
    while still showing memory-side cache behavior improvements, and the
    expectation that the security implications of finer granularity
    randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
    performance impact of the shuffling appears to be in the noise compared to
    other memory initialization work.

    This initial randomization can be undone over time so a follow-on patch is
    introduced to inject entropy on page free decisions. It is reasonable to
    ask if the page free entropy is sufficient, but it is not enough due to
    the in-order initial freeing of pages. At the start of that process
    putting page1 in front or behind page0 still keeps them close together,
    page2 is still near page1 and has a high chance of being adjacent. As
    more pages are added ordering diversity improves, but there is still high
    page locality for the low address pages and this leads to no significant
    impact to the cache conflict rate.

    [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
    [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
    [3]: https://lkml.org/lkml/2018/10/12/309

    [dan.j.williams@intel.com: fix shuffle enable]
    Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
    [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
    Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Qian Cai
    Reviewed-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on
    both 32 and 64 bit systems. lazy_max_pages() deals with "unsigned long"
    that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on
    64 bit as well.

    Link: http://lkml.kernel.org/r/20190131162452.25879-1-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Andrew Morton
    Reviewed-by: William Kucharski
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Thomas Garnier
    Cc: Oleksiy Avramchenko
    Cc: Joel Fernandes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Commit 763b218ddfaf ("mm: add preempt points into __purge_vmap_area_lazy()")
    introduced some preempt points, one of those is making an allocation
    more prioritized over lazy free of vmap areas.

    Prioritizing an allocation over freeing does not work well all the time,
    i.e. it should be rather a compromise.

    1) Number of lazy pages directly influences the busy list length thus
    on operations like: allocation, lookup, unmap, remove, etc.

    2) Under heavy stress of vmalloc subsystem I run into a situation when
    memory usage gets increased hitting out_of_memory -> panic state due to
    completely blocking of logic that frees vmap areas in the
    __purge_vmap_area_lazy() function.

    Establish a threshold passing which the freeing is prioritized back over
    allocation creating a balance between each other.

    Using vmalloc test driver in "stress mode", i.e. When all available
    test cases are run simultaneously on all online CPUs applying a
    pressure on the vmalloc subsystem, my HiKey 960 board runs out of
    memory due to the fact that __purge_vmap_area_lazy() logic simply is
    not able to free pages in time.

    How I run it:

    1) You should build your kernel with CONFIG_TEST_VMALLOC=m
    2) ./tools/testing/selftests/vm/test_vmalloc.sh stress

    During this test "vmap_lazy_nr" pages will go far beyond acceptable
    lazy_max_pages() threshold, that will lead to enormous busy list size
    and other problems including allocation time and so on.

    Link: http://lkml.kernel.org/r/20190124115648.9433-3-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Thomas Garnier
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Joel Fernandes
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Cc: Joel Fernandes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Commit 0139aa7b7fa ("mm: rename _count, field of the struct page, to
    _refcount") left out a couple of references to the old field name. Fix
    that.

    Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il
    Fixes: 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount")
    Signed-off-by: Baruch Siach
    Reviewed-by: Andrew Morton
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baruch Siach
     
  • Merge misc updates from Andrew Morton:

    - a few misc things and hotfixes

    - ocfs2

    - almost all of MM

    * emailed patches from Andrew Morton : (139 commits)
    kernel/memremap.c: remove the unused device_private_entry_fault() export
    mm: delete find_get_entries_tag
    mm/huge_memory.c: make __thp_get_unmapped_area static
    mm/mprotect.c: fix compilation warning because of unused 'mm' variable
    mm/page-writeback: introduce tracepoint for wait_on_page_writeback()
    mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags
    mm/Kconfig: update "Memory Model" help text
    mm/vmscan.c: don't disable irq again when count pgrefill for memcg
    mm: memblock: make keeping memblock memory opt-in rather than opt-out
    hugetlbfs: always use address space in inode for resv_map pointer
    mm/z3fold.c: support page migration
    mm/z3fold.c: add structure for buddy handles
    mm/z3fold.c: improve compression by extending search
    mm/z3fold.c: introduce helper functions
    mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist
    mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig
    mm/vmscan.c: simplify shrink_inactive_list()
    fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
    xen/privcmd-buf.c: convert to use vm_map_pages_zero()
    xen/gntdev.c: convert to use vm_map_pages()
    ...

    Linus Torvalds
     
  • I removed the only user of this and hadn't noticed it was now unused.

    Link: http://lkml.kernel.org/r/20190430152929.21813-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • __thp_get_unmapped_area is only used in mm/huge_memory.c. Make it static.
    Tested by building and booting the kernel.

    Link: http://lkml.kernel.org/r/20190504102353.GA22525@bharath12345-Inspiron-5559
    Signed-off-by: Bharath Vedartham
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bharath Vedartham
     
  • Since 0cbe3e26abe0 ("mm: update ptep_modify_prot_start/commit to take
    vm_area_struct as arg") the only place that uses the local 'mm' variable
    in change_pte_range() is the call to set_pte_at().

    Many architectures define set_pte_at() as macro that does not use the 'mm'
    parameter, which generates the following compilation warning:

    CC mm/mprotect.o
    mm/mprotect.c: In function 'change_pte_range':
    mm/mprotect.c:42:20: warning: unused variable 'mm' [-Wunused-variable]
    struct mm_struct *mm = vma->vm_mm;
    ^~

    Fix it by passing vma->mm to set_pte_at() and dropping the local 'mm'
    variable in change_pte_range().

    [liu.song.a23@gmail.com: fix missed conversions]
    Link: http://lkml.kernel.org/r/CAPhsuW6wcQgYLHNdBdw6m0YiR4RWsS4XzfpSKU7wBLLeOCTbpw@mail.gmail.comLink: http://lkml.kernel.org/r/1557305432-4940-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Recently there have been some hung tasks on our server due to
    wait_on_page_writeback(), and we want to know the details of this
    PG_writeback, i.e. this page is writing back to which device. But it is
    not so convenient to get the details.

    I think it would be better to introduce a tracepoint for diagnosing the
    writeback details.

    Link: http://lkml.kernel.org/r/1556274402-19018-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Reviewed-by: Andrew Morton
    Cc: Jan Kara
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • The help describing the memory model selection is outdated. It still says
    that SPARSEMEM is experimental and DISCONTIGMEM is a preferred over
    SPARSEMEM.

    Update the help text for the relevant options:
    * add a generic help for the "Memory Model" prompt
    * add description for FLATMEM
    * reduce the description of DISCONTIGMEM and add a deprecation note
    * prefer SPARSEMEM over DISCONTIGMEM

    Link: http://lkml.kernel.org/r/1556188531-20728-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • We can use __count_memcg_events() directly because this callsite is alreay
    protected by spin_lock_irq().

    Link: http://lkml.kernel.org/r/1556093494-30798-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • Most architectures do not need the memblock memory after the page
    allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
    arch Kconfig.

    Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
    logic makes it clear which architectures actually use memblock after
    system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK
    to the architectures that are still missing that option.

    Link: http://lkml.kernel.org/r/1556102150-32517-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michael Ellerman (powerpc)
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Richard Kuo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Ralf Baechle
    Cc: Paul Burton
    Cc: James Hogan
    Cc: Ley Foon Tan
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Eric Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Continuing discussion about 58b6e5e8f1ad ("hugetlbfs: fix memory leak for
    resv_map") brought up the issue that inode->i_mapping may not point to the
    address space embedded within the inode at inode eviction time. The
    hugetlbfs truncate routine handles this by explicitly using inode->i_data.
    However, code cleaning up the resv_map will still use the address space
    pointed to by inode->i_mapping. Luckily, private_data is NULL for address
    spaces in all such cases today but, there is no guarantee this will
    continue.

    Change all hugetlbfs code getting a resv_map pointer to explicitly get it
    from the address space embedded within the inode. In addition, add more
    comments in the code to indicate why this is being done.

    Link: http://lkml.kernel.org/r/20190419204435.16984-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Yufen Yu
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Now that we are not using page address in handles directly, we can make
    z3fold pages movable to decrease the memory fragmentation z3fold may
    create over time.

    This patch starts advertising non-headless z3fold pages as movable and
    uses the existing kernel infrastructure to implement moving of such pages
    per memory management subsystem's request. It thus implements 3 required
    callbacks for page migration:

    * isolation callback: z3fold_page_isolate(): try to isolate the page by
    removing it from all lists. Pages scheduled for some activity and
    mapped pages will not be isolated. Return true if isolation was
    successful or false otherwise

    * migration callback: z3fold_page_migrate(): re-check critical
    conditions and migrate page contents to the new page provided by the
    memory subsystem. Returns 0 on success or negative error code otherwise

    * putback callback: z3fold_page_putback(): put back the page if
    z3fold_page_migrate() for it failed permanently (i. e. not with
    -EAGAIN code).

    [lkp@intel.com: z3fold_page_isolate() can be static]
    Link: http://lkml.kernel.org/r/20190419130924.GA161478@ivb42
    Link: http://lkml.kernel.org/r/20190417103922.31253da5c366c4ebe0419cfc@gmail.com
    Signed-off-by: Vitaly Wool
    Signed-off-by: kbuild test robot
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Dan Streetman
    Cc: Krzysztof Kozlowski
    Cc: Oleksiy Avramchenko
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • For z3fold to be able to move its pages per request of the memory
    subsystem, it should not use direct object addresses in handles. Instead,
    it will create abstract handles (3 per page) which will contain pointers
    to z3fold objects. Thus, it will be possible to change these pointers
    when z3fold page is moved.

    Link: http://lkml.kernel.org/r/20190417103826.484eaf18c1294d682769880f@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Dan Streetman
    Cc: Krzysztof Kozlowski
    Cc: Oleksiy Avramchenko
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • The current z3fold implementation only searches this CPU's page lists for
    a fitting page to put a new object into. This patch adds quick search for
    very well fitting pages (i. e. those having exactly the required number
    of free space) on other CPUs too, before allocating a new page for that
    object.

    Link: http://lkml.kernel.org/r/20190417103733.72ae81abe1552397c95a008e@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Dan Streetman
    Cc: Krzysztof Kozlowski
    Cc: Oleksiy Avramchenko
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • Patch series "z3fold: support page migration", v2.

    This patchset implements page migration support and slightly better buddy
    search. To implement page migration support, z3fold has to move away from
    the current scheme of handle encoding. i. e. stop encoding page address
    in handles. Instead, a small per-page structure is created which will
    contain actual addresses for z3fold objects, while pointers to fields of
    that structure will be used as handles.

    Thus, it will be possible to change the underlying addresses to reflect
    page migration.

    To support migration itself, 3 callbacks will be implemented:

    1: isolation callback: z3fold_page_isolate(): try to isolate the page
    by removing it from all lists. Pages scheduled for some activity and
    mapped pages will not be isolated. Return true if isolation was
    successful or false otherwise

    2: migration callback: z3fold_page_migrate(): re-check critical
    conditions and migrate page contents to the new page provided by the
    system. Returns 0 on success or negative error code otherwise

    3: putback callback: z3fold_page_putback(): put back the page if
    z3fold_page_migrate() for it failed permanently (i. e. not with
    -EAGAIN code).

    To make sure an isolated page doesn't get freed, its kref is incremented
    in z3fold_page_isolate() and decremented during post-migration compaction,
    if migration was successful, or by z3fold_page_putback() in the other
    case.

    Since the new handle encoding scheme implies slight memory consumption
    increase, better buddy search (which decreases memory consumption) is
    included in this patchset.

    This patch (of 4):

    Introduce a separate helper function for object allocation, as well as 2
    smaller helpers to add a buddy to the list and to get a pointer to the
    pool from the z3fold header. No functional changes here.

    Link: http://lkml.kernel.org/r/20190417103633.a4bb770b5bf0fb7e43ce1666@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Dan Streetman
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Krzysztof Kozlowski
    Cc: Oleksiy Avramchenko
    Cc: Uladzislau Rezki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • Because rmqueue_pcplist() is only called when order is 0, we don't need to
    use order as a parameter.

    Link: http://lkml.kernel.org/r/1555591709-11744-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • Add 2 new Kconfig variables that are not used by anyone. I check that
    various make ARCH=somearch allmodconfig do work and do not complain. This
    new Kconfig needs to be added first so that device drivers that depend on
    HMM can be updated.

    Once drivers are updated then I can update the HMM Kconfig to depend on
    this new Kconfig in a followup patch.

    This is about solving Kconfig for HMM given that device driver are
    going through their own tree we want to avoid changing them from the mm
    tree. So plan is:

    1 - Kernel release N add the new Kconfig to mm/Kconfig (this patch)
    2 - Kernel release N+1 update driver to depend on new Kconfig ie
    stop using ARCH_HASH_HMM and start using ARCH_HAS_HMM_MIRROR
    and ARCH_HAS_HMM_DEVICE (one or the other or both depending
    on the driver)
    3 - Kernel release N+2 remove ARCH_HASH_HMM and do final Kconfig
    update in mm/Kconfig

    Link: http://lkml.kernel.org/r/20190417211141.17580-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Guenter Roeck
    Cc: Leon Romanovsky
    Cc: Jason Gunthorpe
    Cc: Ralph Campbell
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This merges together duplicated patterns of code. Also, replace
    count_memcg_events() with its irq-careless namesake, because they are
    already called in interrupts disabled context.

    Link: http://lkml.kernel.org/r/2ece1df4-2989-bc9b-6172-61e9fdde5bfd@virtuozzo.com
    Signed-off-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Reviewed-by: Daniel Jordan
    Acked-by: Johannes Weiner
    Cc: Baoquan He
    Cc: Davidlohr Bueso

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Patch series "mm: Use vm_map_pages() and vm_map_pages_zero() API", v5.

    This patch (of 5):

    Previouly drivers have their own way of mapping range of kernel
    pages/memory into user vma and this was done by invoking vm_insert_page()
    within a loop.

    As this pattern is common across different drivers, it can be generalized
    by creating new functions and using them across the drivers.

    vm_map_pages() is the API which can be used to map kernel memory/pages in
    drivers which have considered vm_pgoff

    vm_map_pages_zero() is the API which can be used to map a range of kernel
    memory/pages in drivers which have not considered vm_pgoff. vm_pgoff is
    passed as default 0 for those drivers.

    We _could_ then at a later "fix" these drivers which are using
    vm_map_pages_zero() to behave according to the normal vm_pgoff offsetting
    simply by removing the _zero suffix on the function name and if that
    causes regressions, it gives us an easy way to revert.

    Tested on Rockchip hardware and display is working, including talking to
    Lima via prime.

    Link: http://lkml.kernel.org/r/751cb8a0f4c3e67e95c58a3b072937617f338eea.1552921225.git.jrdr.linux@gmail.com
    Signed-off-by: Souptick Joarder
    Suggested-by: Russell King
    Suggested-by: Matthew Wilcox
    Reviewed-by: Mike Rapoport
    Tested-by: Heiko Stuebner
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: Robin Murphy
    Cc: Joonsoo Kim
    Cc: Thierry Reding
    Cc: Kees Cook
    Cc: Marek Szyprowski
    Cc: Stefan Richter
    Cc: Sandy Huang
    Cc: David Airlie
    Cc: Oleksandr Andrushchenko
    Cc: Joerg Roedel
    Cc: Pawel Osciak
    Cc: Kyungmin Park
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Mauro Carvalho Chehab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • 'default n' is the default value for any bool or tristate Kconfig
    setting so there is no need to write it explicitly.

    Also since commit f467c5640c29 ("kconfig: only write '# CONFIG_FOO
    is not set' for visible symbols") the Kconfig behavior is the same
    regardless of 'default n' being present or not:

    ...
    One side effect of (and the main motivation for) this change is making
    the following two definitions behave exactly the same:

    config FOO
    bool

    config FOO
    bool
    default n

    With this change, neither of these will generate a
    '# CONFIG_FOO is not set' line (assuming FOO isn't selected/implied).
    That might make it clearer to people that a bare 'default n' is
    redundant.
    ...

    Link: http://lkml.kernel.org/r/c3385916-e4d4-37d3-b330-e6b7dff83a52@samsung.com
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     
  • With the default overcommit==guess we occasionally run into mmap
    rejections despite plenty of memory that would get dropped under
    pressure but just isn't accounted reclaimable. One example of this is
    dying cgroups pinned by some page cache. A previous case was auxiliary
    path name memory associated with dentries; we have since annotated
    those allocations to avoid overcommit failures (see d79f7aa496fc ("mm:
    treat indirectly reclaimable memory as free in overcommit logic")).

    But trying to classify all allocated memory reliably as reclaimable
    and unreclaimable is a bit of a fool's errand. There could be a myriad
    of dependencies that constantly change with kernel versions.

    It becomes even more questionable of an effort when considering how
    this estimate of available memory is used: it's not compared to the
    system-wide allocated virtual memory in any way. It's not even
    compared to the allocating process's address space. It's compared to
    the single allocation request at hand!

    So we have an elaborate left-hand side of the equation that tries to
    assess the exact breathing room the system has available down to a
    page - and then compare it to an isolated allocation request with no
    additional context. We could fail an allocation of N bytes, but for
    two allocations of N/2 bytes we'd do this elaborate dance twice in a
    row and then still let N bytes of virtual memory through. This doesn't
    make a whole lot of sense.

    Let's take a step back and look at the actual goal of the
    heuristic. From the documentation:

    Heuristic overcommit handling. Obvious overcommits of address
    space are refused. Used for a typical system. It ensures a
    seriously wild allocation fails while allowing overcommit to
    reduce swap usage. root is allowed to allocate slightly more
    memory in this mode. This is the default.

    If all we want to do is catch clearly bogus allocation requests
    irrespective of the general virtual memory situation, the physical
    memory counter-part doesn't need to be that complicated, either.

    When in GUESS mode, catch wild allocations by comparing their request
    size to total amount of ram and swap in the system.

    Link: http://lkml.kernel.org/r/20190412191418.26333-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • All callers of arch_remove_memory() ignore errors. And we should really
    try to remove any errors from the memory removal path. No more errors are
    reported from __remove_pages(). BUG() in s390x code in case
    arch_remove_memory() is triggered. We may implement that properly later.
    WARN in case powerpc code failed to remove the section mapping, which is
    better than ignoring the error completely right now.

    Link: http://lkml.kernel.org/r/20190409100148.24703-5-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Oscar Salvador
    Cc: "Kirill A. Shutemov"
    Cc: Christophe Leroy
    Cc: Stefan Agner
    Cc: Nicholas Piggin
    Cc: Pavel Tatashin
    Cc: Vasily Gorbik
    Cc: Arun KS
    Cc: Geert Uytterhoeven
    Cc: Masahiro Yamada
    Cc: Rob Herring
    Cc: Joonsoo Kim
    Cc: Wei Yang
    Cc: Qian Cai
    Cc: Mathieu Malaterre
    Cc: Andrew Banman
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Mike Travis
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Let's just warn in case a section is not valid instead of failing to
    remove somewhere in the middle of the process, returning an error that
    will be mostly ignored by callers.

    Link: http://lkml.kernel.org/r/20190409100148.24703-4-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Pavel Tatashin
    Cc: Qian Cai
    Cc: Wei Yang
    Cc: Arun KS
    Cc: Mathieu Malaterre
    Cc: Andrew Banman
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Christophe Leroy
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Masahiro Yamada
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Mike Travis
    Cc: Nicholas Piggin
    Cc: Oscar Salvador
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Stefan Agner
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Failing while removing memory is mostly ignored and cannot really be
    handled. Let's treat errors in unregister_memory_section() in a nice way,
    warning, but continuing.

    Link: http://lkml.kernel.org/r/20190409100148.24703-3-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Ingo Molnar
    Cc: Andrew Banman
    Cc: Mike Travis
    Cc: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Qian Cai
    Cc: Wei Yang
    Cc: Arun KS
    Cc: Mathieu Malaterre
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Christophe Leroy
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Masahiro Yamada
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Nicholas Piggin
    Cc: Oscar Salvador
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Stefan Agner
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "mm/memory_hotplug: Better error handling when removing
    memory", v1.

    Error handling when removing memory is somewhat messed up right now. Some
    errors result in warnings, others are completely ignored. Memory unplug
    code can essentially not deal with errors properly as of now.
    remove_memory() will never fail.

    We have basically two choices:
    1. Allow arch_remov_memory() and friends to fail, propagating errors via
    remove_memory(). Might be problematic (e.g. DIMMs consisting of multiple
    pieces added/removed separately).
    2. Don't allow the functions to fail, handling errors in a nicer way.

    It seems like most errors that can theoretically happen are really corner
    cases and mostly theoretical (e.g. "section not valid"). However e.g.
    aborting removal of sections while all callers simply continue in case of
    errors is not nice.

    If we can gurantee that removal of memory always works (and WARN/skip in
    case of theoretical errors so we can figure out what is going on), we can
    go ahead and implement better error handling when adding memory.

    E.g. via add_memory():

    arch_add_memory()
    ret = do_stuff()
    if (ret) {
    arch_remove_memory();
    goto error;
    }

    Handling here that arch_remove_memory() might fail is basically
    impossible. So I suggest, let's avoid reporting errors while removing
    memory, warning on theoretical errors instead and continuing instead of
    aborting.

    This patch (of 4):

    __add_pages() doesn't add the memory resource, so __remove_pages()
    shouldn't remove it. Let's factor it out. Especially as it is a special
    case for memory used as system memory, added via add_memory() and friends.

    We now remove the resource after removing the sections instead of doing it
    the other way around. I don't think this change is problematic.

    add_memory()
    register memory resource
    arch_add_memory()

    remove_memory
    arch_remove_memory()
    release memory resource

    While at it, explain why we ignore errors and that it only happeny if
    we remove memory in a different granularity as we added it.

    [david@redhat.com: fix printk warning]
    Link: http://lkml.kernel.org/r/20190417120204.6997-1-david@redhat.com
    Link: http://lkml.kernel.org/r/20190409100148.24703-2-david@redhat.com
    Signed-off-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Pavel Tatashin
    Cc: Wei Yang
    Cc: Qian Cai
    Cc: Arun KS
    Cc: Mathieu Malaterre
    Cc: Andrew Banman
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Christophe Leroy
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Masahiro Yamada
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Mike Travis
    Cc: Nicholas Piggin
    Cc: Oscar Salvador
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Stefan Agner
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Link: http://lkml.kernel.org/r/20190304155240.19215-1-ldufour@linux.ibm.com
    Signed-off-by: Laurent Dufour
    Reviewed-by: William Kucharski
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • arch_add_memory, __add_pages take a want_memblock which controls whether
    the newly added memory should get the sysfs memblock user API (e.g.
    ZONE_DEVICE users do not want/need this interface). Some callers even
    want to control where do we allocate the memmap from by configuring
    altmap.

    Add a more generic hotplug context for arch_add_memory and __add_pages.
    struct mhp_restrictions contains flags which contains additional features
    to be enabled by the memory hotplug (MHP_MEMBLOCK_API currently) and
    altmap for alternative memmap allocator.

    This patch shouldn't introduce any functional change.

    [akpm@linux-foundation.org: build fix]
    Link: http://lkml.kernel.org/r/20190408082633.2864-3-osalvador@suse.de
    Signed-off-by: Michal Hocko
    Signed-off-by: Oscar Salvador
    Cc: Dan Williams
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • check_pages_isolated_cb currently accounts the whole pfn range as being
    offlined if test_pages_isolated suceeds on the range. This is based on
    the assumption that all pages in the range are freed which is currently
    the case in most cases but it won't be with later changes, as pages marked
    as vmemmap won't be isolated.

    Move the offlined pages counting to offline_isolated_pages_cb and rely on
    __offline_isolated_pages to return the correct value.
    check_pages_isolated_cb will still do it's primary job and check the pfn
    range.

    While we are at it remove check_pages_isolated and offline_isolated_pages
    and use directly walk_system_ram_range as do in online_pages.

    Link: http://lkml.kernel.org/r/20190408082633.2864-2-osalvador@suse.de
    Reviewed-by: David Hildenbrand
    Signed-off-by: Michal Hocko
    Signed-off-by: Oscar Salvador
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Add yet another iterator, for_each_free_mem_range_in_zone_from, and then
    use it to support initializing and freeing pages in groups no larger than
    MAX_ORDER_NR_PAGES. By doing this we can greatly improve the cache
    locality of the pages while we do several loops over them in the init and
    freeing process.

    We are able to tighten the loops further as a result of the "from"
    iterator as we can perform the initial checks for first_init_pfn in our
    first call to the iterator, and continue without the need for those checks
    via the "from" iterator. I have added this functionality in the function
    called deferred_init_mem_pfn_range_in_zone that primes the iterator and
    causes us to exit if we encounter any failure.

    On my x86_64 test system with 384GB of memory per node I saw a reduction
    in initialization time from 1.85s to 1.38s as a result of this patch.

    Link: http://lkml.kernel.org/r/20190405221231.12227.85836.stgit@localhost.localdomain
    Signed-off-by: Alexander Duyck
    Reviewed-by: Pavel Tatashin
    Cc: Mike Rapoport
    Cc: Michal Hocko
    Cc: Dave Jiang
    Cc: Matthew Wilcox
    Cc: Ingo Molnar
    Cc:
    Cc: Khalid Aziz
    Cc: Mike Rapoport
    Cc: Vlastimil Babka
    Cc: Dan Williams
    Cc: Laurent Dufour
    Cc: Mel Gorman
    Cc: David S. Miller
    Cc: "Kirill A. Shutemov"
    Cc: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Duyck