Eric Lee / smarc-fsl-linux-kernel

17 May, 2019

1 commit

7878c231d slab: remove /proc/slab_allocators ... Browse Code »

It turned out that DEBUG_SLAB_LEAK is still broken even after recent
recue efforts that when there is a large number of objects like
kmemleak_object which is normal on a debug kernel,

# grep kmemleak /proc/slabinfo
kmemleak_object 2243606 3436210 ...

reading /proc/slab_allocators could easily loop forever while processing
the kmemleak_object cache and any additional freeing or allocating
objects will trigger a reprocessing. To make a situation worse,
soft-lockups could easily happen in this sitatuion which will call
printk() to allocate more kmemleak objects to guarantee an infinite
loop.

Also, since it seems no one had noticed when it was totally broken
more than 2-year ago - see the commit fcf88917dd43 ("slab: fix a crash
by reading /proc/slab_allocators"), probably nobody cares about it
anymore due to the decline of the SLAB. Just remove it entirely.

Suggested-by: Vlastimil Babka
Suggested-by: Linus Torvalds
Signed-off-by: Qian Cai
Signed-off-by: Linus Torvalds

Qian Cai
2019-05-17 06:51:55 +0800

15 May, 2019

39 commits

def0fdae8 mm: memcontrol: fix NUMA round-robin reclaim at intermediate level ... Browse Code »

When a cgroup is reclaimed on behalf of a configured limit, reclaim
needs to round-robin through all NUMA nodes that hold pages of the memcg
in question. However, when assembling the mask of candidate NUMA nodes,
the code only consults the *local* cgroup LRU counters, not the
recursive counters for the entire subtree. Cgroup limits are frequently
configured against intermediate cgroups that do not have memory on their
own LRUs. In this case, the node mask will always come up empty and
reclaim falls back to scanning only the current node.

If a cgroup subtree has some memory on one node but the processes are
bound to another node afterwards, the limit reclaim will never age or
reclaim that memory anymore.

To fix this, use the recursive LRU counts for a cgroup subtree to
determine which nodes hold memory of that cgroup.

The code has been broken like this forever, so it doesn't seem to be a
problem in practice. I just noticed it while reviewing the way the LRU
counters are used in general.

Link: http://lkml.kernel.org/r/20190412151507.2769-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-05-15 10:52:53 +0800
42a300353 mm: memcontrol: fix recursive statistics correctness & scalabilty ... Browse Code »

Right now, when somebody needs to know the recursive memory statistics
and events of a cgroup subtree, they need to walk the entire subtree and
sum up the counters manually.

There are two issues with this:

1. When a cgroup gets deleted, its stats are lost. The state counters
should all be 0 at that point, of course, but the events are not.
When this happens, the event counters, which are supposed to be
monotonic, can go backwards in the parent cgroups.

2. During regular operation, we always have a certain number of lazily
freed cgroups sitting around that have been deleted, have no tasks,
but have a few cache pages remaining. These groups' statistics do not
change until we eventually hit memory pressure, but somebody
watching, say, memory.stat on an ancestor has to iterate those every
time.

This patch addresses both issues by introducing recursive counters at
each level that are propagated from the write side when stats change.

Upward propagation happens when the per-cpu caches spill over into the
local atomic counter. This is the same thing we do during charge and
uncharge, except that the latter uses atomic RMWs, which are more
expensive; stat changes happen at around the same rate. In a sparse
file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
nesting levels, perf shows __mod_memcg_page state at ~1%.

Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-05-15 10:52:53 +0800
db9adbcbe mm: memcontrol: move stat/event counting functions out-of-line ... Browse Code »

These are getting too big to be inlined in every callsite. They were
stolen from vmstat.c, which already out-of-lines them, and they have
only been growing since. The callsites aren't that hot, either.

Move __mod_memcg_state()
__mod_lruvec_state() and
__count_memcg_events() out of line and add kerneldoc comments.

Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-05-15 10:52:53 +0800
205b20cc5 mm: memcontrol: make cgroup stats and events query API explicitly local ... Browse Code »

Patch series "mm: memcontrol: memory.stat cost & correctness".

The cgroup memory.stat file holds recursive statistics for the entire
subtree. The current implementation does this tree walk on-demand
whenever the file is read. This is giving us problems in production.

1. The cost of aggregating the statistics on-demand is high. A lot of
system service cgroups are mostly idle and their stats don't change
between reads, yet we always have to check them. There are also always
some lazily-dying cgroups sitting around that are pinned by a handful
of remaining page cache; the same applies to them.

In an application that periodically monitors memory.stat in our
fleet, we have seen the aggregation consume up to 5% CPU time.

2. When cgroups die and disappear from the cgroup tree, so do their
accumulated vm events. The result is that the event counters at
higher-level cgroups can go backwards and confuse some of our
automation, let alone people looking at the graphs over time.

To address both issues, this patch series changes the stat
implementation to spill counts upwards when the counters change.

The upward spilling is batched using the existing per-cpu cache. In a
sparse file stress test with 5 level cgroup nesting, the additional cost
of the flushing was negligible (a little under 1% of CPU at 100% CPU
utilization, compared to the 5% of reading memory.stat during regular
operation).

This patch (of 4):

memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
currently returning the state of the local memcg or lruvec, not the
recursive state.

In practice there is a demand for both versions, although the callers
that want the recursive counts currently sum them up by hand.

Per default, cgroups are considered recursive entities and generally we
expect more users of the recursive counters, with the local counts being
special cases. To reflect that in the name, add a _local suffix to the
current implementations.

The following patch will re-incarnate these functions with recursive
semantics, but with an O(1) implementation.

[hannes@cmpxchg.org: fix bisection hole]
Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-05-15 10:52:53 +0800
871789d4a mm, memcg: rename ambiguously named memory.stat counters and functions ... Browse Code »

I spent literally an hour trying to work out why an earlier version of
my memory.events aggregation code doesn't work properly, only to find
out I was calling memcg->events instead of memcg->memory_events, which
is fairly confusing.

This naming seems in need of reworking, so make it harder to do the
wrong thing by using vmevents instead of events, which makes it more
clear that these are vm counters rather than memcg-specific counters.

There are also a few other inconsistent names in both the percpu and
aggregated structs, so these are all cleaned up to be more coherent and
easy to understand.

This commit contains code cleanup only: there are no logic changes.

[akpm@linux-foundation.org: fix it for preceding changes]
Link: http://lkml.kernel.org/r/20190208224319.GA23801@chrisdown.name
Signed-off-by: Chris Down
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Roman Gushchin
Cc: Dennis Zhou
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Down
2019-05-15 10:52:52 +0800
134fca906 mm/mincore.c: make mincore() more conservative ... Browse Code »

The semantics of what mincore() considers to be resident is not
completely clear, but Linux has always (since 2.3.52, which is when
mincore() was initially done) treated it as "page is available in page
cache".

That's potentially a problem, as that [in]directly exposes
meta-information about pagecache / memory mapping state even about
memory not strictly belonging to the process executing the syscall,
opening possibilities for sidechannel attacks.

Change the semantics of mincore() so that it only reveals pagecache
information for non-anonymous mappings that belog to files that the
calling process could (if it tried to) successfully open for writing;
otherwise we'd be including shared non-exclusive mappings, which

- is the sidechannel

- is not the usecase for mincore(), as that's primarily used for data,
not (shared) text

[jkosina@suse.cz: v2]
Link: http://lkml.kernel.org/r/20190312141708.6652-2-vbabka@suse.cz
[mhocko@suse.com: restructure can_do_mincore() conditions]
Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1903062342020.19912@cbobk.fhfr.pm
Signed-off-by: Jiri Kosina
Signed-off-by: Vlastimil Babka
Acked-by: Josh Snyder
Acked-by: Michal Hocko
Originally-by: Linus Torvalds
Originally-by: Dominique Martinet
Cc: Andy Lutomirski
Cc: Dave Chinner
Cc: Kevin Easton
Cc: Matthew Wilcox
Cc: Cyril Hrubis
Cc: Tejun Heo
Cc: Kirill A. Shutemov
Cc: Daniel Gruss
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Kosina
2019-05-15 10:52:48 +0800
97500a4a5 mm: maintain randomization of page free lists ... Browse Code »

When freeing a page with an order >= shuffle_page_order randomly select
the front or back of the list for insertion.

While the mm tries to defragment physical pages into huge pages this can
tend to make the page allocator more predictable over time. Inject the
front-back randomness to preserve the initial randomness established by
shuffle_free_memory() when the kernel was booted.

The overhead of this manipulation is constrained by only being applied
for MAX_ORDER sized pages by default.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: Kees Cook
Cc: Michal Hocko
Cc: Dave Hansen
Cc: Keith Busch
Cc: Robert Elliott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-05-15 10:52:48 +0800
b03641af6 mm: move buddy list manipulations into helpers ... Browse Code »

In preparation for runtime randomization of the zone lists, take all
(well, most of) the list_*() functions in the buddy allocator and put
them in helper functions. Provide a common control point for injecting
additional behavior when freeing pages.

[dan.j.williams@intel.com: fix buddy list helpers]
Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
[vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Signed-off-by: Vlastimil Babka
Tested-by: Tetsuo Handa
Acked-by: Michal Hocko
Cc: Vlastimil Babka
Cc: Dave Hansen
Cc: Kees Cook
Cc: Keith Busch
Cc: Robert Elliott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-05-15 10:52:48 +0800
e900a918b mm: shuffle initial free memory to improve memory-side-cache utilization ... Browse Code »

Patch series "mm: Randomize free memory", v10.

This patch (of 3):

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache. Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms. In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM. Now, this capability is going
to be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

It's been a problem in the HPC space:
http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

A kernel module called zonesort is available to try to help:
https://software.intel.com/en-us/articles/xeon-phi-software

and this abandoned patch series proposed that for the kernel:
https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

Dan's patch series doesn't attempt to ensure buffers won't conflict, but
also reduces the chance that the buffers will. This will make performance
more consistent, albeit slower than "optimal" (which is near impossible
to attain in a general-purpose kernel). That's better than forcing
users to deploy remedies like:
"To eliminate this gradual degradation, we have added a Stream
measurement to the Node Health Check that follows each job;
nodes are rebooted whenever their measured memory bandwidth
falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
("x86/numa_emulation: Introduce uniform split capability"). With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes. A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable. While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts. Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
3X speedup in a contrived case that tries to force cache conflicts.
The contrived cased used the numa_emulation capability to force an
instance of the benchmark to be run in two of the near-memory sized
numa nodes. If both instances were placed on the same emulated they
would fit and cause zero conflicts. While on separate emulated nodes
without randomization they underutilized the cache and conflicted
unnecessarily due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
size that exceeded cache size by 3X. The cache conflict rate was 8%
for the first run and degraded to 21% after page allocator aging. With
randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
cache-conflict rates, but the overall throughput dropped by 7% with
randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
enabled on platforms without a memory-side-cache and saw a mix of some
improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads. For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected. Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization. Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects. The kernel page allocator, especially
early in system boot, has predictable first-in-first out behavior for
physical pages. Pages are freed in physical address order when first
onlined.

Quoting Kees:
"While we already have a base-address randomization
(CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
memory layouts would certainly be using the predictability of
allocation ordering (i.e. for attacks where the base address isn't
important: only the relative positions between allocated memory).
This is common in lots of heap-style attacks. They try to gain
control over ordering by spraying allocations, etc.

I'd really like to see this because it gives us something similar
to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order allocated.
However, it should be noted, the concrete security benefits are hard to
quantify, and no known CVE is mitigated by this randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
are initially populated with free memory at boot and at hotplug time. Do
this based on either the presence of a page_alloc.shuffle=Y command line
parameter, or autodetection of a memory-side-cache (to be added in a
follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
4MB this trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements, and the
expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
performance impact of the shuffling appears to be in the noise compared to
other memory initialization work.

This initial randomization can be undone over time so a follow-on patch is
introduced to inject entropy on page free decisions. It is reasonable to
ask if the page free entropy is sufficient, but it is not enough due to
the in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent. As
more pages are added ordering diversity improves, but there is still high
page locality for the low address pages and this leads to no significant
impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
[3]: https://lkml.org/lkml/2018/10/12/309

[dan.j.williams@intel.com: fix shuffle enable]
Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
[cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Signed-off-by: Qian Cai
Reviewed-by: Kees Cook
Acked-by: Michal Hocko
Cc: Dave Hansen
Cc: Keith Busch
Cc: Robert Elliott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-05-15 10:52:48 +0800
4d36e6f80 mm/vmalloc.c: convert vmap_lazy_nr to atomic_long_t ... Browse Code »

vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on
both 32 and 64 bit systems. lazy_max_pages() deals with "unsigned long"
that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on
64 bit as well.

Link: http://lkml.kernel.org/r/20190131162452.25879-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Andrew Morton
Reviewed-by: William Kucharski
Cc: Michal Hocko
Cc: Matthew Wilcox
Cc: Thomas Garnier
Cc: Oleksiy Avramchenko
Cc: Joel Fernandes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-05-15 10:52:48 +0800
68571be99 mm/vmalloc.c: add priority threshold to __purge_vmap_area_lazy() ... Browse Code »

Commit 763b218ddfaf ("mm: add preempt points into __purge_vmap_area_lazy()")
introduced some preempt points, one of those is making an allocation
more prioritized over lazy free of vmap areas.

Prioritizing an allocation over freeing does not work well all the time,
i.e. it should be rather a compromise.

1) Number of lazy pages directly influences the busy list length thus
on operations like: allocation, lookup, unmap, remove, etc.

2) Under heavy stress of vmalloc subsystem I run into a situation when
memory usage gets increased hitting out_of_memory -> panic state due to
completely blocking of logic that frees vmap areas in the
__purge_vmap_area_lazy() function.

Establish a threshold passing which the freeing is prioritized back over
allocation creating a balance between each other.

Using vmalloc test driver in "stress mode", i.e. When all available
test cases are run simultaneously on all online CPUs applying a
pressure on the vmalloc subsystem, my HiKey 960 board runs out of
memory due to the fact that __purge_vmap_area_lazy() logic simply is
not able to free pages in time.

How I run it:

1) You should build your kernel with CONFIG_TEST_VMALLOC=m
2) ./tools/testing/selftests/vm/test_vmalloc.sh stress

During this test "vmap_lazy_nr" pages will go far beyond acceptable
lazy_max_pages() threshold, that will lead to enormous busy list size
and other problems including allocation time and so on.

Link: http://lkml.kernel.org/r/20190124115648.9433-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony)
Reviewed-by: Andrew Morton
Cc: Michal Hocko
Cc: Matthew Wilcox
Cc: Thomas Garnier
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Cc: Joel Fernandes
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Tejun Heo
Cc: Joel Fernandes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uladzislau Rezki (Sony)
2019-05-15 10:52:48 +0800
136ac591f mm: update references to page _refcount ... Browse Code »

Commit 0139aa7b7fa ("mm: rename _count, field of the struct page, to
_refcount") left out a couple of references to the old field name. Fix
that.

Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il
Fixes: 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount")
Signed-off-by: Baruch Siach
Reviewed-by: Andrew Morton
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Baruch Siach
2019-05-15 10:52:47 +0800
318222a35 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc updates from Andrew Morton:

- a few misc things and hotfixes

- ocfs2

- almost all of MM

* emailed patches from Andrew Morton : (139 commits)
kernel/memremap.c: remove the unused device_private_entry_fault() export
mm: delete find_get_entries_tag
mm/huge_memory.c: make __thp_get_unmapped_area static
mm/mprotect.c: fix compilation warning because of unused 'mm' variable
mm/page-writeback: introduce tracepoint for wait_on_page_writeback()
mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags
mm/Kconfig: update "Memory Model" help text
mm/vmscan.c: don't disable irq again when count pgrefill for memcg
mm: memblock: make keeping memblock memory opt-in rather than opt-out
hugetlbfs: always use address space in inode for resv_map pointer
mm/z3fold.c: support page migration
mm/z3fold.c: add structure for buddy handles
mm/z3fold.c: improve compression by extending search
mm/z3fold.c: introduce helper functions
mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist
mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig
mm/vmscan.c: simplify shrink_inactive_list()
fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
xen/privcmd-buf.c: convert to use vm_map_pages_zero()
xen/gntdev.c: convert to use vm_map_pages()
...

Linus Torvalds
2019-05-15 01:10:55 +0800
a1b8e6abf mm: delete find_get_entries_tag ... Browse Code »

I removed the only user of this and hadn't noticed it was now unused.

Link: http://lkml.kernel.org/r/20190430152929.21813-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle)
Reviewed-by: Ross Zwisler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2019-05-15 00:47:51 +0800
b3b07077b mm/huge_memory.c: make __thp_get_unmapped_area static ... Browse Code »

__thp_get_unmapped_area is only used in mm/huge_memory.c. Make it static.
Tested by building and booting the kernel.

Link: http://lkml.kernel.org/r/20190504102353.GA22525@bharath12345-Inspiron-5559
Signed-off-by: Bharath Vedartham
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bharath Vedartham
2019-05-15 00:47:51 +0800
94393c789 mm/mprotect.c: fix compilation warning because of unused 'mm' variable ... Browse Code »

Since 0cbe3e26abe0 ("mm: update ptep_modify_prot_start/commit to take
vm_area_struct as arg") the only place that uses the local 'mm' variable
in change_pte_range() is the call to set_pte_at().

Many architectures define set_pte_at() as macro that does not use the 'mm'
parameter, which generates the following compilation warning:

CC mm/mprotect.o
mm/mprotect.c: In function 'change_pte_range':
mm/mprotect.c:42:20: warning: unused variable 'mm' [-Wunused-variable]
struct mm_struct *mm = vma->vm_mm;
^~

Fix it by passing vma->mm to set_pte_at() and dropping the local 'mm'
variable in change_pte_range().

[liu.song.a23@gmail.com: fix missed conversions]
Link: http://lkml.kernel.org/r/CAPhsuW6wcQgYLHNdBdw6m0YiR4RWsS4XzfpSKU7wBLLeOCTbpw@mail.gmail.comLink: http://lkml.kernel.org/r/1557305432-4940-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport
Reviewed-by: Andrew Morton
Cc: Song Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2019-05-15 00:47:51 +0800
19343b5bd mm/page-writeback: introduce tracepoint for wait_on_page_writeback() ... Browse Code »

Recently there have been some hung tasks on our server due to
wait_on_page_writeback(), and we want to know the details of this
PG_writeback, i.e. this page is writing back to which device. But it is
not so convenient to get the details.

I think it would be better to introduce a tracepoint for diagnosing the
writeback details.

Link: http://lkml.kernel.org/r/1556274402-19018-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao
Reviewed-by: Andrew Morton
Cc: Jan Kara
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yafang Shao
2019-05-15 00:47:51 +0800
d66d109d3 mm/Kconfig: update "Memory Model" help text ... Browse Code »

The help describing the memory model selection is outdated. It still says
that SPARSEMEM is experimental and DISCONTIGMEM is a preferred over
SPARSEMEM.

Update the help text for the relevant options:
* add a generic help for the "Memory Model" prompt
* add description for FLATMEM
* reduce the description of DISCONTIGMEM and add a deprecation note
* prefer SPARSEMEM over DISCONTIGMEM

Link: http://lkml.kernel.org/r/1556188531-20728-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2019-05-15 00:47:51 +0800
2fa2690ca mm/vmscan.c: don't disable irq again when count pgrefill for memcg ... Browse Code »

We can use __count_memcg_events() directly because this callsite is alreay
protected by spin_lock_irq().

Link: http://lkml.kernel.org/r/1556093494-30798-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao
Reviewed-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yafang Shao
2019-05-15 00:47:51 +0800
350e88bad mm: memblock: make keeping memblock memory opt-in rather than opt-out ... Browse Code »

Most architectures do not need the memblock memory after the page
allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
arch Kconfig.

Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
logic makes it clear which architectures actually use memblock after
system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK
to the architectures that are still missing that option.

Link: http://lkml.kernel.org/r/1556102150-32517-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport
Acked-by: Michael Ellerman (powerpc)
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Richard Kuo
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Geert Uytterhoeven
Cc: Ralf Baechle
Cc: Paul Burton
Cc: James Hogan
Cc: Ley Foon Tan
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Yoshinori Sato
Cc: Rich Felker
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Eric Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2019-05-15 00:47:50 +0800
f27a5136f hugetlbfs: always use address space in inode for resv_map pointer ... Browse Code »

Continuing discussion about 58b6e5e8f1ad ("hugetlbfs: fix memory leak for
resv_map") brought up the issue that inode->i_mapping may not point to the
address space embedded within the inode at inode eviction time. The
hugetlbfs truncate routine handles this by explicitly using inode->i_data.
However, code cleaning up the resv_map will still use the address space
pointed to by inode->i_mapping. Luckily, private_data is NULL for address
spaces in all such cases today but, there is no guarantee this will
continue.

Change all hugetlbfs code getting a resv_map pointer to explicitly get it
from the address space embedded within the inode. In addition, add more
comments in the code to indicate why this is being done.

Link: http://lkml.kernel.org/r/20190419204435.16984-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz
Reported-by: Yufen Yu
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: "Kirill A . Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2019-05-15 00:47:50 +0800
1f862989b mm/z3fold.c: support page migration ... Browse Code »

Now that we are not using page address in handles directly, we can make
z3fold pages movable to decrease the memory fragmentation z3fold may
create over time.

This patch starts advertising non-headless z3fold pages as movable and
uses the existing kernel infrastructure to implement moving of such pages
per memory management subsystem's request. It thus implements 3 required
callbacks for page migration:

* isolation callback: z3fold_page_isolate(): try to isolate the page by
removing it from all lists. Pages scheduled for some activity and
mapped pages will not be isolated. Return true if isolation was
successful or false otherwise

* migration callback: z3fold_page_migrate(): re-check critical
conditions and migrate page contents to the new page provided by the
memory subsystem. Returns 0 on success or negative error code otherwise

* putback callback: z3fold_page_putback(): put back the page if
z3fold_page_migrate() for it failed permanently (i. e. not with
-EAGAIN code).

[lkp@intel.com: z3fold_page_isolate() can be static]
Link: http://lkml.kernel.org/r/20190419130924.GA161478@ivb42
Link: http://lkml.kernel.org/r/20190417103922.31253da5c366c4ebe0419cfc@gmail.com
Signed-off-by: Vitaly Wool
Signed-off-by: kbuild test robot
Cc: Bartlomiej Zolnierkiewicz
Cc: Dan Streetman
Cc: Krzysztof Kozlowski
Cc: Oleksiy Avramchenko
Cc: Uladzislau Rezki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Wool
2019-05-15 00:47:50 +0800
7c2b8baa6 mm/z3fold.c: add structure for buddy handles ... Browse Code »

For z3fold to be able to move its pages per request of the memory
subsystem, it should not use direct object addresses in handles. Instead,
it will create abstract handles (3 per page) which will contain pointers
to z3fold objects. Thus, it will be possible to change these pointers
when z3fold page is moved.

Link: http://lkml.kernel.org/r/20190417103826.484eaf18c1294d682769880f@gmail.com
Signed-off-by: Vitaly Wool
Cc: Bartlomiej Zolnierkiewicz
Cc: Dan Streetman
Cc: Krzysztof Kozlowski
Cc: Oleksiy Avramchenko
Cc: Uladzislau Rezki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Wool
2019-05-15 00:47:50 +0800
351618b20 mm/z3fold.c: improve compression by extending search ... Browse Code »

The current z3fold implementation only searches this CPU's page lists for
a fitting page to put a new object into. This patch adds quick search for
very well fitting pages (i. e. those having exactly the required number
of free space) on other CPUs too, before allocating a new page for that
object.

Link: http://lkml.kernel.org/r/20190417103733.72ae81abe1552397c95a008e@gmail.com
Signed-off-by: Vitaly Wool
Cc: Bartlomiej Zolnierkiewicz
Cc: Dan Streetman
Cc: Krzysztof Kozlowski
Cc: Oleksiy Avramchenko
Cc: Uladzislau Rezki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Wool
2019-05-15 00:47:50 +0800
9050cce10 mm/z3fold.c: introduce helper functions ... Browse Code »

Patch series "z3fold: support page migration", v2.

This patchset implements page migration support and slightly better buddy
search. To implement page migration support, z3fold has to move away from
the current scheme of handle encoding. i. e. stop encoding page address
in handles. Instead, a small per-page structure is created which will
contain actual addresses for z3fold objects, while pointers to fields of
that structure will be used as handles.

Thus, it will be possible to change the underlying addresses to reflect
page migration.

To support migration itself, 3 callbacks will be implemented:

1: isolation callback: z3fold_page_isolate(): try to isolate the page
by removing it from all lists. Pages scheduled for some activity and
mapped pages will not be isolated. Return true if isolation was
successful or false otherwise

2: migration callback: z3fold_page_migrate(): re-check critical
conditions and migrate page contents to the new page provided by the
system. Returns 0 on success or negative error code otherwise

3: putback callback: z3fold_page_putback(): put back the page if
z3fold_page_migrate() for it failed permanently (i. e. not with
-EAGAIN code).

To make sure an isolated page doesn't get freed, its kref is incremented
in z3fold_page_isolate() and decremented during post-migration compaction,
if migration was successful, or by z3fold_page_putback() in the other
case.

Since the new handle encoding scheme implies slight memory consumption
increase, better buddy search (which decreases memory consumption) is
included in this patchset.

This patch (of 4):

Introduce a separate helper function for object allocation, as well as 2
smaller helpers to add a buddy to the list and to get a pointer to the
pool from the z3fold header. No functional changes here.

Link: http://lkml.kernel.org/r/20190417103633.a4bb770b5bf0fb7e43ce1666@gmail.com
Signed-off-by: Vitaly Wool
Cc: Dan Streetman
Cc: Bartlomiej Zolnierkiewicz
Cc: Krzysztof Kozlowski
Cc: Oleksiy Avramchenko
Cc: Uladzislau Rezki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Wool
2019-05-15 00:47:50 +0800
1c52e6d06 mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist ... Browse Code »

Because rmqueue_pcplist() is only called when order is 0, we don't need to
use order as a parameter.

Link: http://lkml.kernel.org/r/1555591709-11744-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao
Acked-by: Michal Hocko
Acked-by: Pankaj Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yafang Shao
2019-05-15 00:47:50 +0800
2c8fc3dcf mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig ... Browse Code »

Add 2 new Kconfig variables that are not used by anyone. I check that
various make ARCH=somearch allmodconfig do work and do not complain. This
new Kconfig needs to be added first so that device drivers that depend on
HMM can be updated.

Once drivers are updated then I can update the HMM Kconfig to depend on
this new Kconfig in a followup patch.

This is about solving Kconfig for HMM given that device driver are
going through their own tree we want to avoid changing them from the mm
tree. So plan is:

1 - Kernel release N add the new Kconfig to mm/Kconfig (this patch)
2 - Kernel release N+1 update driver to depend on new Kconfig ie
stop using ARCH_HASH_HMM and start using ARCH_HAS_HMM_MIRROR
and ARCH_HAS_HMM_DEVICE (one or the other or both depending
on the driver)
3 - Kernel release N+2 remove ARCH_HASH_HMM and do final Kconfig
update in mm/Kconfig

Link: http://lkml.kernel.org/r/20190417211141.17580-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Cc: Guenter Roeck
Cc: Leon Romanovsky
Cc: Jason Gunthorpe
Cc: Ralph Campbell
Cc: John Hubbard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2019-05-15 00:47:50 +0800
f46b79120 mm/vmscan.c: simplify shrink_inactive_list() ... Browse Code »

This merges together duplicated patterns of code. Also, replace
count_memcg_events() with its irq-careless namesake, because they are
already called in interrupts disabled context.

Link: http://lkml.kernel.org/r/2ece1df4-2989-bc9b-6172-61e9fdde5bfd@virtuozzo.com
Signed-off-by: Kirill Tkhai
Acked-by: Michal Hocko
Reviewed-by: Daniel Jordan
Acked-by: Johannes Weiner
Cc: Baoquan He
Cc: Davidlohr Bueso

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2019-05-15 00:47:50 +0800
a667d7456 mm: introduce new vm_map_pages() and vm_map_pages_zero() API ... Browse Code »

Patch series "mm: Use vm_map_pages() and vm_map_pages_zero() API", v5.

This patch (of 5):

Previouly drivers have their own way of mapping range of kernel
pages/memory into user vma and this was done by invoking vm_insert_page()
within a loop.

As this pattern is common across different drivers, it can be generalized
by creating new functions and using them across the drivers.

vm_map_pages() is the API which can be used to map kernel memory/pages in
drivers which have considered vm_pgoff

vm_map_pages_zero() is the API which can be used to map a range of kernel
memory/pages in drivers which have not considered vm_pgoff. vm_pgoff is
passed as default 0 for those drivers.

We _could_ then at a later "fix" these drivers which are using
vm_map_pages_zero() to behave according to the normal vm_pgoff offsetting
simply by removing the _zero suffix on the function name and if that
causes regressions, it gives us an easy way to revert.

Tested on Rockchip hardware and display is working, including talking to
Lima via prime.

Link: http://lkml.kernel.org/r/751cb8a0f4c3e67e95c58a3b072937617f338eea.1552921225.git.jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder
Suggested-by: Russell King
Suggested-by: Matthew Wilcox
Reviewed-by: Mike Rapoport
Tested-by: Heiko Stuebner
Cc: Michal Hocko
Cc: "Kirill A. Shutemov"
Cc: Vlastimil Babka
Cc: Rik van Riel
Cc: Stephen Rothwell
Cc: Peter Zijlstra
Cc: Robin Murphy
Cc: Joonsoo Kim
Cc: Thierry Reding
Cc: Kees Cook
Cc: Marek Szyprowski
Cc: Stefan Richter
Cc: Sandy Huang
Cc: David Airlie
Cc: Oleksandr Andrushchenko
Cc: Joerg Roedel
Cc: Pawel Osciak
Cc: Kyungmin Park
Cc: Boris Ostrovsky
Cc: Juergen Gross
Cc: Mauro Carvalho Chehab
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Souptick Joarder
2019-05-15 00:47:50 +0800
62afcd1cb mm: remove redundant 'default n' from Kconfig-s ... Browse Code »

'default n' is the default value for any bool or tristate Kconfig
setting so there is no need to write it explicitly.

Also since commit f467c5640c29 ("kconfig: only write '# CONFIG_FOO
is not set' for visible symbols") the Kconfig behavior is the same
regardless of 'default n' being present or not:

...
One side effect of (and the main motivation for) this change is making
the following two definitions behave exactly the same:

config FOO
bool

config FOO
bool
default n

With this change, neither of these will generate a
'# CONFIG_FOO is not set' line (assuming FOO isn't selected/implied).
That might make it clearer to people that a bare 'default n' is
redundant.
...

Link: http://lkml.kernel.org/r/c3385916-e4d4-37d3-b330-e6b7dff83a52@samsung.com
Signed-off-by: Bartlomiej Zolnierkiewicz
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bartlomiej Zolnierkiewicz
2019-05-15 00:47:50 +0800
8c7829b04 mm: fix false-positive OVERCOMMIT_GUESS failures ... Browse Code »

With the default overcommit==guess we occasionally run into mmap
rejections despite plenty of memory that would get dropped under
pressure but just isn't accounted reclaimable. One example of this is
dying cgroups pinned by some page cache. A previous case was auxiliary
path name memory associated with dentries; we have since annotated
those allocations to avoid overcommit failures (see d79f7aa496fc ("mm:
treat indirectly reclaimable memory as free in overcommit logic")).

But trying to classify all allocated memory reliably as reclaimable
and unreclaimable is a bit of a fool's errand. There could be a myriad
of dependencies that constantly change with kernel versions.

It becomes even more questionable of an effort when considering how
this estimate of available memory is used: it's not compared to the
system-wide allocated virtual memory in any way. It's not even
compared to the allocating process's address space. It's compared to
the single allocation request at hand!

So we have an elaborate left-hand side of the equation that tries to
assess the exact breathing room the system has available down to a
page - and then compare it to an isolated allocation request with no
additional context. We could fail an allocation of N bytes, but for
two allocations of N/2 bytes we'd do this elaborate dance twice in a
row and then still let N bytes of virtual memory through. This doesn't
make a whole lot of sense.

Let's take a step back and look at the actual goal of the
heuristic. From the documentation:

Heuristic overcommit handling. Obvious overcommits of address
space are refused. Used for a typical system. It ensures a
seriously wild allocation fails while allowing overcommit to
reduce swap usage. root is allowed to allocate slightly more
memory in this mode. This is the default.

If all we want to do is catch clearly bogus allocation requests
irrespective of the general virtual memory situation, the physical
memory counter-part doesn't need to be that complicated, either.

When in GUESS mode, catch wild allocations by comparing their request
size to total amount of ram and swap in the system.

Link: http://lkml.kernel.org/r/20190412191418.26333-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Roman Gushchin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-05-15 00:47:50 +0800
ac5c94264 mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never fail ... Browse Code »

All callers of arch_remove_memory() ignore errors. And we should really
try to remove any errors from the memory removal path. No more errors are
reported from __remove_pages(). BUG() in s390x code in case
arch_remove_memory() is triggered. We may implement that properly later.
WARN in case powerpc code failed to remove the section mapping, which is
better than ignoring the error completely right now.

Link: http://lkml.kernel.org/r/20190409100148.24703-5-david@redhat.com
Signed-off-by: David Hildenbrand
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Yoshinori Sato
Cc: Rich Felker
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Michal Hocko
Cc: Mike Rapoport
Cc: Oscar Salvador
Cc: "Kirill A. Shutemov"
Cc: Christophe Leroy
Cc: Stefan Agner
Cc: Nicholas Piggin
Cc: Pavel Tatashin
Cc: Vasily Gorbik
Cc: Arun KS
Cc: Geert Uytterhoeven
Cc: Masahiro Yamada
Cc: Rob Herring
Cc: Joonsoo Kim
Cc: Wei Yang
Cc: Qian Cai
Cc: Mathieu Malaterre
Cc: Andrew Banman
Cc: Greg Kroah-Hartman
Cc: Ingo Molnar
Cc: Mike Travis
Cc: Oscar Salvador
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-05-15 00:47:50 +0800
9d1d887d7 mm/memory_hotplug: make __remove_section() never fail ... Browse Code »

Let's just warn in case a section is not valid instead of failing to
remove somewhere in the middle of the process, returning an error that
will be mostly ignored by callers.

Link: http://lkml.kernel.org/r/20190409100148.24703-4-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: David Hildenbrand
Cc: Pavel Tatashin
Cc: Qian Cai
Cc: Wei Yang
Cc: Arun KS
Cc: Mathieu Malaterre
Cc: Andrew Banman
Cc: Andy Lutomirski
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Christophe Leroy
Cc: Dave Hansen
Cc: Fenghua Yu
Cc: Geert Uytterhoeven
Cc: Greg Kroah-Hartman
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Ingo Molnar
Cc: Joonsoo Kim
Cc: "Kirill A. Shutemov"
Cc: Martin Schwidefsky
Cc: Masahiro Yamada
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Mike Travis
Cc: Nicholas Piggin
Cc: Oscar Salvador
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Rich Felker
Cc: Rob Herring
Cc: Stefan Agner
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vasily Gorbik
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-05-15 00:47:49 +0800
cb7b3a368 mm/memory_hotplug: make unregister_memory_section() never fail ... Browse Code »

Failing while removing memory is mostly ignored and cannot really be
handled. Let's treat errors in unregister_memory_section() in a nice way,
warning, but continuing.

Link: http://lkml.kernel.org/r/20190409100148.24703-3-david@redhat.com
Signed-off-by: David Hildenbrand
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: Ingo Molnar
Cc: Andrew Banman
Cc: Mike Travis
Cc: David Hildenbrand
Cc: Oscar Salvador
Cc: Michal Hocko
Cc: Pavel Tatashin
Cc: Qian Cai
Cc: Wei Yang
Cc: Arun KS
Cc: Mathieu Malaterre
Cc: Andy Lutomirski
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Christophe Leroy
Cc: Dave Hansen
Cc: Fenghua Yu
Cc: Geert Uytterhoeven
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Joonsoo Kim
Cc: "Kirill A. Shutemov"
Cc: Martin Schwidefsky
Cc: Masahiro Yamada
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Nicholas Piggin
Cc: Oscar Salvador
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Rich Felker
Cc: Rob Herring
Cc: Stefan Agner
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vasily Gorbik
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-05-15 00:47:49 +0800
d9eb1417c mm/memory_hotplug: release memory resource after arch_remove_memory() ... Browse Code »

Patch series "mm/memory_hotplug: Better error handling when removing
memory", v1.

Error handling when removing memory is somewhat messed up right now. Some
errors result in warnings, others are completely ignored. Memory unplug
code can essentially not deal with errors properly as of now.
remove_memory() will never fail.

We have basically two choices:
1. Allow arch_remov_memory() and friends to fail, propagating errors via
remove_memory(). Might be problematic (e.g. DIMMs consisting of multiple
pieces added/removed separately).
2. Don't allow the functions to fail, handling errors in a nicer way.

It seems like most errors that can theoretically happen are really corner
cases and mostly theoretical (e.g. "section not valid"). However e.g.
aborting removal of sections while all callers simply continue in case of
errors is not nice.

If we can gurantee that removal of memory always works (and WARN/skip in
case of theoretical errors so we can figure out what is going on), we can
go ahead and implement better error handling when adding memory.

E.g. via add_memory():

arch_add_memory()
ret = do_stuff()
if (ret) {
arch_remove_memory();
goto error;
}

Handling here that arch_remove_memory() might fail is basically
impossible. So I suggest, let's avoid reporting errors while removing
memory, warning on theoretical errors instead and continuing instead of
aborting.

This patch (of 4):

__add_pages() doesn't add the memory resource, so __remove_pages()
shouldn't remove it. Let's factor it out. Especially as it is a special
case for memory used as system memory, added via add_memory() and friends.

We now remove the resource after removing the sections instead of doing it
the other way around. I don't think this change is problematic.

add_memory()
register memory resource
arch_add_memory()

remove_memory
arch_remove_memory()
release memory resource

While at it, explain why we ignore errors and that it only happeny if
we remove memory in a different granularity as we added it.

[david@redhat.com: fix printk warning]
Link: http://lkml.kernel.org/r/20190417120204.6997-1-david@redhat.com
Link: http://lkml.kernel.org/r/20190409100148.24703-2-david@redhat.com
Signed-off-by: David Hildenbrand
Reviewed-by: Oscar Salvador
Cc: Michal Hocko
Cc: David Hildenbrand
Cc: Pavel Tatashin
Cc: Wei Yang
Cc: Qian Cai
Cc: Arun KS
Cc: Mathieu Malaterre
Cc: Andrew Banman
Cc: Andy Lutomirski
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Christophe Leroy
Cc: Dave Hansen
Cc: Fenghua Yu
Cc: Geert Uytterhoeven
Cc: Greg Kroah-Hartman
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Ingo Molnar
Cc: Joonsoo Kim
Cc: "Kirill A. Shutemov"
Cc: Martin Schwidefsky
Cc: Masahiro Yamada
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Mike Travis
Cc: Nicholas Piggin
Cc: Oscar Salvador
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Rich Felker
Cc: Rob Herring
Cc: Stefan Agner
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vasily Gorbik
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-05-15 00:47:49 +0800
2346a5605 mm/filemap.c: fix minor typo ... Browse Code »

Link: http://lkml.kernel.org/r/20190304155240.19215-1-ldufour@linux.ibm.com
Signed-off-by: Laurent Dufour
Reviewed-by: William Kucharski
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laurent Dufour
2019-05-15 00:47:49 +0800
940519f0c mm, memory_hotplug: provide a more generic restrictions for memory hotplug ... Browse Code »

arch_add_memory, __add_pages take a want_memblock which controls whether
the newly added memory should get the sysfs memblock user API (e.g.
ZONE_DEVICE users do not want/need this interface). Some callers even
want to control where do we allocate the memmap from by configuring
altmap.

Add a more generic hotplug context for arch_add_memory and __add_pages.
struct mhp_restrictions contains flags which contains additional features
to be enabled by the memory hotplug (MHP_MEMBLOCK_API currently) and
altmap for alternative memmap allocator.

This patch shouldn't introduce any functional change.

[akpm@linux-foundation.org: build fix]
Link: http://lkml.kernel.org/r/20190408082633.2864-3-osalvador@suse.de
Signed-off-by: Michal Hocko
Signed-off-by: Oscar Salvador
Cc: Dan Williams
Cc: David Hildenbrand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2019-05-15 00:47:49 +0800
5557c766a mm, memory_hotplug: cleanup memory offline path ... Browse Code »

check_pages_isolated_cb currently accounts the whole pfn range as being
offlined if test_pages_isolated suceeds on the range. This is based on
the assumption that all pages in the range are freed which is currently
the case in most cases but it won't be with later changes, as pages marked
as vmemmap won't be isolated.

Move the offlined pages counting to offline_isolated_pages_cb and rely on
__offline_isolated_pages to return the correct value.
check_pages_isolated_cb will still do it's primary job and check the pfn
range.

While we are at it remove check_pages_isolated and offline_isolated_pages
and use directly walk_system_ram_range as do in online_pages.

Link: http://lkml.kernel.org/r/20190408082633.2864-2-osalvador@suse.de
Reviewed-by: David Hildenbrand
Signed-off-by: Michal Hocko
Signed-off-by: Oscar Salvador
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2019-05-15 00:47:49 +0800
0e56acae4 mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections ... Browse Code »

Add yet another iterator, for_each_free_mem_range_in_zone_from, and then
use it to support initializing and freeing pages in groups no larger than
MAX_ORDER_NR_PAGES. By doing this we can greatly improve the cache
locality of the pages while we do several loops over them in the init and
freeing process.

We are able to tighten the loops further as a result of the "from"
iterator as we can perform the initial checks for first_init_pfn in our
first call to the iterator, and continue without the need for those checks
via the "from" iterator. I have added this functionality in the function
called deferred_init_mem_pfn_range_in_zone that primes the iterator and
causes us to exit if we encounter any failure.

On my x86_64 test system with 384GB of memory per node I saw a reduction
in initialization time from 1.85s to 1.38s as a result of this patch.

Link: http://lkml.kernel.org/r/20190405221231.12227.85836.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck
Reviewed-by: Pavel Tatashin
Cc: Mike Rapoport
Cc: Michal Hocko
Cc: Dave Jiang
Cc: Matthew Wilcox
Cc: Ingo Molnar
Cc:
Cc: Khalid Aziz
Cc: Mike Rapoport
Cc: Vlastimil Babka
Cc: Dan Williams
Cc: Laurent Dufour
Cc: Mel Gorman
Cc: David S. Miller
Cc: "Kirill A. Shutemov"
Cc: Pavel Tatashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Duyck
2019-05-15 00:47:49 +0800