18 Aug, 2018
10 commits
-
Rename new_sparse_init() to sparse_init() which enables it. Delete old
sparse_init() and all the code that became obsolete with.[pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()]
Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin
Tested-by: Michael Ellerman [powerpc]
Tested-by: Oscar Salvador
Reviewed-by: Oscar Salvador
Cc: Pasha Tatashin
Cc: Abdul Haleem
Cc: Baoquan He
Cc: Daniel Jordan
Cc: Dan Williams
Cc: Dave Hansen
Cc: David Rientjes
Cc: Greg Kroah-Hartman
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jérôme Glisse
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Souptick Joarder
Cc: Steven Sistare
Cc: Vlastimil Babka
Cc: Wei Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
sparse_init() requires to temporary allocate two large buffers: usemap_map
and map_map. Baoquan He has identified that these buffers are so large
that Linux is not bootable on small memory machines, such as a kdump boot.
The buffers are especially large when CONFIG_X86_5LEVEL is set, as they
are scaled to the maximum physical memory size.Baoquan provided a fix, which reduces these sizes of these buffers, but it
is much better to get rid of them entirely.Add a new way to initialize sparse memory: sparse_init_nid(), which only
operates within one memory node, and thus allocates memory either in large
contiguous block or allocates section by section. This eliminates the
need for use of temporary buffers.For simplified bisecting and review temporarly call sparse_init()
new_sparse_init(), the new interface is going to be enabled as well as old
code removed in the next patch.Link: http://lkml.kernel.org/r/20180712203730.8703-5-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin
Reviewed-by: Oscar Salvador
Tested-by: Oscar Salvador
Tested-by: Michael Ellerman [powerpc]
Cc: Pasha Tatashin
Cc: Abdul Haleem
Cc: Baoquan He
Cc: Daniel Jordan
Cc: Dan Williams
Cc: Dave Hansen
Cc: David Rientjes
Cc: Greg Kroah-Hartman
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jérôme Glisse
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Souptick Joarder
Cc: Steven Sistare
Cc: Vlastimil Babka
Cc: Wei Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now that both variants of sparse memory use the same buffers to populate
memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the
common place.Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin
Tested-by: Michael Ellerman [powerpc]
Tested-by: Oscar Salvador
Reviewed-by: Andrew Morton
Cc: Pasha Tatashin
Cc: Abdul Haleem
Cc: Baoquan He
Cc: Daniel Jordan
Cc: Dan Williams
Cc: Dave Hansen
Cc: David Rientjes
Cc: Greg Kroah-Hartman
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jérôme Glisse
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Souptick Joarder
Cc: Steven Sistare
Cc: Vlastimil Babka
Cc: Wei Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
non-vmemmap sparse also allocated large contiguous chunk of memory, and if
fails falls back to smaller allocations. Use the same functions to
allocate buffer as the vmemmap-sparseLink: http://lkml.kernel.org/r/20180712203730.8703-3-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin
Tested-by: Michael Ellerman [powerpc]
Reviewed-by: Oscar Salvador
Tested-by: Oscar Salvador
Cc: Pasha Tatashin
Cc: Abdul Haleem
Cc: Baoquan He
Cc: Daniel Jordan
Cc: Dan Williams
Cc: Dave Hansen
Cc: David Rientjes
Cc: Greg Kroah-Hartman
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jérôme Glisse
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Souptick Joarder
Cc: Steven Sistare
Cc: Vlastimil Babka
Cc: Wei Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "sparse_init rewrite", v6.
In sparse_init() we allocate two large buffers to temporary hold usemap
and memmap for the whole machine. However, we can avoid doing that if
we changed sparse_init() to operated on per-node bases instead of doing
it on the whole machine beforehand.As shown by Baoquan
http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.comThe buffers are large enough to cause machine stop to boot on small
memory systems.Another benefit of these changes is that they also obsolete
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER.This patch (of 5):
When struct pages are allocated for sparse-vmemmap VA layout, we first try
to allocate one large buffer, and than if that fails allocate struct pages
for each section as we go.The code that allocates buffer is uses global variables and is spread
across several call sites.Cleanup the code by introducing three functions to handle the global
buffer:sparse_buffer_init() initialize the buffer
sparse_buffer_fini() free the remaining part of the buffer
sparse_buffer_alloc() alloc from the buffer, and if buffer is empty
return NULLDefine these functions in sparse.c instead of sparse-vmemmap.c because
later we will use them for non-vmemmap sparse allocations as well.[akpm@linux-foundation.org: use PTR_ALIGN()]
[akpm@linux-foundation.org: s/BUG_ON/WARN_ON/]
Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin
Tested-by: Michael Ellerman [powerpc]
Reviewed-by: Oscar Salvador
Tested-by: Oscar Salvador
Cc: Pasha Tatashin
Cc: Steven Sistare
Cc: Daniel Jordan
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Dan Williams
Cc: Jan Kara
Cc: Jérôme Glisse
Cc: Souptick Joarder
Cc: Baoquan He
Cc: Greg Kroah-Hartman
Cc: Vlastimil Babka
Cc: Wei Yang
Cc: Dave Hansen
Cc: David Rientjes
Cc: Ingo Molnar
Cc: Abdul Haleem
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In sparse_init(), two temporary pointer arrays, usemap_map and map_map
are allocated with the size of NR_MEM_SECTIONS. They are used to store
each memory section's usemap and mem map if marked as present. With the
help of these two arrays, continuous memory chunk is allocated for
usemap and memmap for memory sections on one node. This avoids too many
memory fragmentations. Like below diagram, '1' indicates the present
memory section, '0' means absent one. The number 'n' could be much
smaller than NR_MEM_SECTIONS on most of systems.|1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0|
-------------------------------------------------------
0 1 2 3 4 5 i i+1 n-1 nIf we fail to populate the page tables to map one section's memmap, its
->section_mem_map will be cleared finally to indicate that it's not
present. After use, these two arrays will be released at the end of
sparse_init().In 4-level paging mode, each array costs 4M which can be ignorable.
While in 5-level paging, they costs 256M each, 512M altogether. Kdump
kernel Usually only reserves very few memory, e.g 256M. So, even thouth
they are temporarily allocated, still not acceptable.In fact, there's no need to allocate them with the size of
NR_MEM_SECTIONS. Since the ->section_mem_map clearing has been deferred
to the last, the number of present memory sections are kept the same
during sparse_init() until we finally clear out the memory section's
->section_mem_map if its usemap or memmap is not correctly handled.
Thus in the middle whenever for_each_present_section_nr() loop is taken,
the i-th present memory section is always the same one.Here only allocate usemap_map and map_map with the size of
'nr_present_sections'. For the i-th present memory section, install its
usemap and memmap to usemap_map[i] and mam_map[i] during allocation.
Then in the last for_each_present_section_nr() loop which clears the
failed memory section's ->section_mem_map, fetch usemap and memmap from
usemap_map[] and map_map[] array and set them into mem_section[]
accordingly.[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180628062857.29658-5-bhe@redhat.com
Signed-off-by: Baoquan He
Reviewed-by: Pavel Tatashin
Cc: Pasha Tatashin
Cc: Dave Hansen
Cc: Kirill A. Shutemov
Cc: Oscar Salvador
Cc: Pankaj Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It's used to pass the size of map data unit into
alloc_usemap_and_memmap, and is preparation for next patch.Link: http://lkml.kernel.org/r/20180228032657.32385-4-bhe@redhat.com
Signed-off-by: Baoquan He
Reviewed-by: Pavel Tatashin
Reviewed-by: Oscar Salvador
Cc: Pasha Tatashin
Cc: Kirill A. Shutemov
Cc: Pankaj Gupta
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
will allocate one continuous memory chunk for mem maps on one node and
populate the relevant page tables to map memory section one by one. If
fail to populate for a certain mem section, print warning and its
->section_mem_map will be cleared to cancel the marking of being
present. Like this, the number of mem sections marked as present could
become less during sparse_init() execution.Here just defer the ms->section_mem_map clearing if failed to populate
its page tables until the last for_each_present_section_nr() loop. This
is in preparation for later optimizing the mem map allocation.[akpm@linux-foundation.org: remove now-unused local `ms', per Oscar]
Link: http://lkml.kernel.org/r/20180228032657.32385-3-bhe@redhat.com
Signed-off-by: Baoquan He
Acked-by: Dave Hansen
Reviewed-by: Pavel Tatashin
Reviewed-by: Oscar Salvador
Cc: Pasha Tatashin
Cc: Kirill A. Shutemov
Cc: Pankaj Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "mm/sparse: Optimize memmap allocation during
sparse_init()", v6.In sparse_init(), two temporary pointer arrays, usemap_map and map_map
are allocated with the size of NR_MEM_SECTIONS. They are used to store
each memory section's usemap and mem map if marked as present. In
5-level paging mode, this will cost 512M memory though they will be
released at the end of sparse_init(). System with few memory, like
kdump kernel which usually only has about 256M, will fail to boot
because of allocation failure if CONFIG_X86_5LEVEL=y.In this patchset, optimize the memmap allocation code to only use
usemap_map and map_map with the size of nr_present_sections. This makes
kdump kernel boot up with normal crashkernel='' setting when
CONFIG_X86_5LEVEL=y.This patch (of 5):
nr_present_sections is used to record how many memory sections are
marked as present during system boot up, and will be used in the later
patch.Link: http://lkml.kernel.org/r/20180228032657.32385-2-bhe@redhat.com
Signed-off-by: Baoquan He
Acked-by: Dave Hansen
Reviewed-by: Andrew Morton
Reviewed-by: Pavel Tatashin
Reviewed-by: Oscar Salvador
Cc: Pasha Tatashin
Cc: Kirill A. Shutemov
Cc: Pankaj Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
sparse_init_one_section() is being called from two sites: sparse_init()
and sparse_add_one_section(). The former calls it from a
for_each_present_section_nr() loop, and the latter marks the section as
present before calling it. This means that when
sparse_init_one_section() gets called, we already know that the section
is present. So there is no point to double check that in the function.This removes the check and makes the function void.
[ross.zwisler@linux.intel.com: fix error path in sparse_add_one_section]
Link: http://lkml.kernel.org/r/20180706190658.6873-1-ross.zwisler@linux.intel.com
[ross.zwisler@linux.intel.com: simplification suggested by Oscar]
Link: http://lkml.kernel.org/r/20180706223358.742-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20180702154325.12196-1-osalvador@techadventures.net
Signed-off-by: Oscar Salvador
Acked-by: Michal Hocko
Reviewed-by: Pavel Tatashin
Reviewed-by: Andrew Morton
Cc: Pasha Tatashin
Cc: Oscar Salvador
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 Jun, 2018
2 commits
-
In commit c4e1be9ec113 ("mm, sparsemem: break out of loops early")
__highest_present_section_nr is introduced to reduce the loop counts for
present section. This is also helpful for usemap and memmap allocation.This patch uses __highest_present_section_nr + 1 to optimize the loop.
Link: http://lkml.kernel.org/r/20180326081956.75275-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang
Reviewed-by: Andrew Morton
Cc: David Rientjes
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When searching a present section, there are two boundaries:
* __highest_present_section_nr
* NR_MEM_SECTIONSAnd it is known, __highest_present_section_nr is a more strict boundary
than NR_MEM_SECTIONS. This means it would be necessary to check
__highest_present_section_nr only.Link: http://lkml.kernel.org/r/20180326081956.75275-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang
Acked-by: David Rientjes
Reviewed-by: Andrew Morton
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 May, 2018
1 commit
-
Memory hotplug and hotremove operate with per-block granularity. If the
machine has a large amount of memory (more than 64G), the size of a
memory block can span multiple sections. By mistake, during hotremove
we set only the first section to offline state.The bug was discovered because kernel selftest started to fail:
https://lkml.kernel.org/r/20180423011247.GK5563@yexl-desktopAfter commit, "mm/memory_hotplug: optimize probe routine". But, the bug
is older than this commit. In this optimization we also added a check
for sections to be in a proper state during hotplug operation.Link: http://lkml.kernel.org/r/20180427145257.15222-1-pasha.tatashin@oracle.com
Fixes: 2d070eab2e82 ("mm: consider zone which is not fully populated to have holes")
Signed-off-by: Pavel Tatashin
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Cc: Vlastimil Babka
Cc: Steven Sistare
Cc: Daniel Jordan
Cc: "Kirill A. Shutemov"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Apr, 2018
1 commit
-
During memory hotplugging we traverse struct pages three times:
1. memset(0) in sparse_add_one_section()
2. loop in __add_section() to set do: set_page_node(page, nid); and
SetPageReserved(page);
3. loop in memmap_init_zone() to call __init_single_pfn()This patch removes the first two loops, and leaves only loop 3. All
struct pages are initialized in one place, the same as it is done during
boot.The benefits:
- We improve memory hotplug performance because we are not evicting the
cache several times and also reduce loop branching overhead.- Remove condition from hotpath in __init_single_pfn(), that was added
in order to fix the problem that was reported by Bharata in the above
email thread, thus also improve performance during normal boot.- Make memory hotplug more similar to the boot memory initialization
path because we zero and initialize struct pages only in one
function.- Simplifies memory hotplug struct page initialization code, and thus
enables future improvements, such as multi-threading the
initialization of struct pages in order to improve hotplug
performance even further on larger machines.[pasha.tatashin@oracle.com: v5]
Link: http://lkml.kernel.org/r/20180228030308.1116-7-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180215165920.8570-7-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin
Reviewed-by: Ingo Molnar
Cc: Michal Hocko
Cc: Baoquan He
Cc: Bharata B Rao
Cc: Daniel Jordan
Cc: Dan Williams
Cc: Greg Kroah-Hartman
Cc: "H. Peter Anvin"
Cc: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Steven Sistare
Cc: Thomas Gleixner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Apr, 2018
1 commit
-
Pul removal of obsolete architecture ports from Arnd Bergmann:
"This removes the entire architecture code for blackfin, cris, frv,
m32r, metag, mn10300, score, and tile, including the associated device
drivers.I have been working with the (former) maintainers for each one to
ensure that my interpretation was right and the code is definitely
unused in mainline kernels. Many had fond memories of working on the
respective ports to start with and getting them included in upstream,
but also saw no point in keeping the port alive without any users.In the end, it seems that while the eight architectures are extremely
different, they all suffered the same fate: There was one company in
charge of an SoC line, a CPU microarchitecture and a software
ecosystem, which was more costly than licensing newer off-the-shelf
CPU cores from a third party (typically ARM, MIPS, or RISC-V). It
seems that all the SoC product lines are still around, but have not
used the custom CPU architectures for several years at this point. In
contrast, CPU instruction sets that remain popular and have actively
maintained kernel ports tend to all be used across multiple licensees.[ See the new nds32 port merged in the previous commit for the next
generation of "one company in charge of an SoC line, a CPU
microarchitecture and a software ecosystem" - Linus ]The removal came out of a discussion that is now documented at
https://lwn.net/Articles/748074/. Unlike the original plans, I'm not
marking any ports as deprecated but remove them all at once after I
made sure that they are all unused. Some architectures (notably tile,
mn10300, and blackfin) are still being shipped in products with old
kernels, but those products will never be updated to newer kernel
releases.After this series, we still have a few architectures without mainline
gcc support:- unicore32 and hexagon both have very outdated gcc releases, but the
maintainers promised to work on providing something newer. At least
in case of hexagon, this will only be llvm, not gcc.- openrisc, risc-v and nds32 are still in the process of finishing
their support or getting it added to mainline gcc in the first
place. They all have patched gcc-7.3 ports that work to some
degree, but complete upstream support won't happen before gcc-8.1.
Csky posted their first kernel patch set last week, their situation
will be similar[ Palmer Dabbelt points out that RISC-V support is in mainline gcc
since gcc-7, although gcc-7.3.0 is the recommended minimum - Linus ]"This really says it all:
2498 files changed, 95 insertions(+), 467668 deletions(-)
* tag 'arch-removal' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (74 commits)
MAINTAINERS: UNICORE32: Change email account
staging: iio: remove iio-trig-bfin-timer driver
tty: hvc: remove tile driver
tty: remove bfin_jtag_comm and hvc_bfin_jtag drivers
serial: remove tile uart driver
serial: remove m32r_sio driver
serial: remove blackfin drivers
serial: remove cris/etrax uart drivers
usb: Remove Blackfin references in USB support
usb: isp1362: remove blackfin arch glue
usb: musb: remove blackfin port
usb: host: remove tilegx platform glue
pwm: remove pwm-bfin driver
i2c: remove bfin-twi driver
spi: remove blackfin related host drivers
watchdog: remove bfin_wdt driver
can: remove bfin_can driver
mmc: remove bfin_sdh driver
input: misc: remove blackfin rotary driver
input: keyboard: remove bf54x driver
...
27 Mar, 2018
1 commit
-
node_memmap_size_bytes() has been unused since the v3.9 kernel, so remove it.
Signed-off-by: David Rientjes
Cc: Dave Hansen
Cc: Laura Abbott
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-mm@kvack.org
Fixes: f03574f2d5b2 ("x86-32, mm: Rip out x86_32 NUMA remapping code")
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803262325540.256524@chino.kir.corp.google.com
Signed-off-by: Ingo Molnar
16 Mar, 2018
1 commit
-
Tile was the only remaining architecture to implement alloc_remap(),
and since that is being removed, there is no point in keeping this
function.Removing all callers simplifies the mem_map handling.
Reviewed-by: Pavel Tatashin
Signed-off-by: Arnd Bergmann
07 Feb, 2018
1 commit
-
Pull libnvdimm updates from Ross Zwisler:
- Require struct page by default for filesystem DAX to remove a number
of surprising failure cases. This includes failures with direct I/O,
gdb and fork(2).- Add support for the new Platform Capabilities Structure added to the
NFIT in ACPI 6.2a. This new table tells us whether the platform
supports flushing of CPU and memory controller caches on unexpected
power loss events.- Revamp vmem_altmap and dev_pagemap handling to clean up code and
better support future future PCI P2P uses.- Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has
become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL
spec, and instead rely on the generic ND_CMD_CALL approach used by
the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}.- Enhance nfit_test so we can test some of the new things added in
version 1.6 of the DSM specification. This includes testing firmware
download and simulating the Last Shutdown State (LSS) status.* tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits)
libnvdimm, namespace: remove redundant initialization of 'nd_mapping'
acpi, nfit: fix register dimm error handling
libnvdimm, namespace: make min namespace size 4K
tools/testing/nvdimm: force nfit_test to depend on instrumented modules
libnvdimm/nfit_test: adding support for unit testing enable LSS status
libnvdimm/nfit_test: add firmware download emulation
nfit-test: Add platform cap support from ACPI 6.2a to test
libnvdimm: expose platform persistence attribute for nd_region
acpi: nfit: add persistent memory control flag for nd_region
acpi: nfit: Add support for detect platform CPU cache flush on power loss
device-dax: Fix trailing semicolon
libnvdimm, btt: fix uninitialized err_lock
dax: require 'struct page' by default for filesystem dax
ext2: auto disable dax instead of failing mount
ext4: auto disable dax instead of failing mount
mm, dax: introduce pfn_t_special()
mm: Fix devm_memremap_pages() collision handling
mm: Fix memory size alignment in devm_memremap_pages_release()
memremap: merge find_dev_pagemap into get_dev_pagemap
memremap: change devm_memremap_pages interface to use struct dev_pagemap
...
03 Feb, 2018
1 commit
01 Feb, 2018
1 commit
-
The comment is confusing. On the one hand, it refers to 32-bit
alignment (struct page alignment on 32-bit platforms), but this would
only guarantee that the 2 lowest bits must be zero. On the other hand,
it claims that at least 3 bits are available, and 3 bits are actually
used.This is not broken, because there is a stronger alignment guarantee,
just less obvious. Let's fix the comment to make it clear how many bits
are available and why.Although memmap arrays are allocated in various places, the resulting
pointer is encoded eventually, so I am adding a BUG_ON() here to enforce
at runtime that all expected bits are indeed available.I have also added a BUILD_BUG_ON to check that PFN_SECTION_SHIFT is
sufficient, because this part of the calculation can be easily checked
at build time.[ptesarik@suse.com: v2]
Link: http://lkml.kernel.org/r/20180125100516.589ea6af@ezekiel.suse.cz
Link: http://lkml.kernel.org/r/20180119080908.3a662e6f@ezekiel.suse.cz
Signed-off-by: Petr Tesarik
Acked-by: Michal Hocko
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Kemi Wang
Cc: YASUAKI ISHIMATSU
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Jan, 2018
2 commits
-
We can just pass this on instead of having to do a radix tree lookup
without proper locking a few levels into the callchain.Signed-off-by: Christoph Hellwig
Signed-off-by: Dan Williams -
We can just pass this on instead of having to do a radix tree lookup
without proper locking a few levels into the callchain.Signed-off-by: Christoph Hellwig
Signed-off-by: Dan Williams
05 Jan, 2018
1 commit
-
In commit 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime
for CONFIG_SPARSEMEM_EXTREME=y") mem_section is allocated at runtime to
save memory.It allocates the first dimension of array with sizeof(struct mem_section).
It costs extra memory, should be sizeof(struct mem_section *).
Fix it.
Link: http://lkml.kernel.org/r/1513932498-20350-1-git-send-email-bhe@redhat.com
Fixes: 83e3c48729 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
Signed-off-by: Baoquan He
Tested-by: Dave Young
Acked-by: Kirill A. Shutemov
Cc: Kirill A. Shutemov
Cc: Ingo Molnar
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Atsushi Kumagai
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Nov, 2017
1 commit
-
vmemmap_alloc_block() will no longer zero the block, so zero memory at
its call sites for everything except struct pages. Struct page memory
is zero'd by struct page initialization.Replace allocators in sparse-vmemmap to use the non-zeroing version.
So, we will get the performance improvement by zeroing the memory in
parallel when struct pages are zeroed.Add struct page zeroing as a part of initialization of other fields in
__init_single_page().This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):BASE FIX
sparse_init 11.244671836s 0.007199623s
zone_sizes_init 4.879775891s 8.355182299s
--------------------------
Total 16.124447727s 8.362381922ssparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().[akpm@linux-foundation.org: make vmemmap_alloc_block_zero() private to sparse-vmemmap.c]
Link: http://lkml.kernel.org/r/20171013173214.27300-10-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin
Reviewed-by: Steven Sistare
Reviewed-by: Daniel Jordan
Reviewed-by: Bob Picco
Tested-by: Bob Picco
Acked-by: Michal Hocko
Cc: Alexander Potapenko
Cc: Andrey Ryabinin
Cc: Ard Biesheuvel
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: David S. Miller
Cc: Dmitry Vyukov
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Mark Rutland
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Sam Ravnborg
Cc: Thomas Gleixner
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Nov, 2017
1 commit
-
Most of x86/mm is already in x86/asm, so merge the rest too.
Signed-off-by: Ingo Molnar
07 Nov, 2017
2 commits
-
Since commit:
83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
we allocate the mem_section array dynamically in sparse_memory_present_with_active_regions(),
but some architectures, like arm64, don't call the routine to initialize sparsemem.Let's move the initialization into memory_present() it should cover all
architectures.Reported-and-tested-by: Sudeep Holla
Tested-by: Bjorn Andersson
Signed-off-by: Kirill A. Shutemov
Acked-by: Will Deacon
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-mm@kvack.org
Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
Link: http://lkml.kernel.org/r/20171107083337.89952-1-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar -
Conflicts:
arch/x86/kernel/cpu/MakefileSigned-off-by: Ingo Molnar
02 Nov, 2017
1 commit
-
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
20 Oct, 2017
1 commit
-
Size of the mem_section[] array depends on the size of the physical address space.
In preparation for boot-time switching between paging modes on x86-64
we need to make the allocation of mem_section[] dynamic, because otherwise
we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB
for 4-level paging and 2MB for 5-level paging mode.The patch allocates the array on the first call to sparse_memory_present_with_active_regions().
Signed-off-by: Kirill A. Shutemov
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Cyrill Gorcunov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170929140821.37654-2-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar
09 Sep, 2017
1 commit
-
online_mem_sections() accidentally marks online only the first section
in the given range. This is a typo which hasn't been noticed because I
haven't tested large 2GB blocks previously. All users of
pfn_to_online_page would get confused on the the rest of the pfn range
in the block.All we need to fix this is to use iterator (pfn) rather than start_pfn.
Link: http://lkml.kernel.org/r/20170904112210.3401-1-mhocko@kernel.org
Fixes: 2d070eab2e82 ("mm: consider zone which is not fully populated to have holes")
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Anshuman Khandual
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Sep, 2017
1 commit
-
Commit f52407ce2dea ("memory hotplug: alloc page from other node in
memory online") has introduced N_HIGH_MEMORY checks to only use NUMA
aware allocations when there is some memory present because the
respective node might not have any memory yet at the time and so it
could fail or even OOM.Things have changed since then though. Zonelists are now always
initialized before we do any allocations even for hotplug (see
959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug
zonelist")).Therefore these checks are not really needed. In fact caller of the
allocator should never care about whether the node is populated because
that might change at any time.Link: http://lkml.kernel.org/r/20170721143915.14161-10-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Shaohua Li
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Toshi Kani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Jul, 2017
3 commits
-
The current memory hotplug implementation relies on having all the
struct pages associate with a zone/node during the physical hotplug
phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the
vast majority of cases this means that they are added to ZONE_NORMAL.
This has been so since 9d99aaa31f59 ("[PATCH] x86_64: Support memory
hotadd without sparsemem") and it wasn't a big deal back then because
movable onlining didn't exist yet.Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
onlining 511c2aba8f07 ("mm, memory-hotplug: dynamic configure movable
memory and portion memory") and then things got more complicated.
Rather than reconsidering the zone association which was no longer
needed (because the memory hotplug already depended on SPARSEMEM) a
convoluted semantic of zone shifting has been developed. Only the
currently last memblock or the one adjacent to the zone_movable can be
onlined movable. This essentially means that the online type changes as
the new memblocks are added.Let's simulate memory hot online manually
$ echo 0x100000000 > /sys/devices/system/memory/probe
$ grep . /sys/devices/system/memory/memory32/valid_zones
Normal Movable$ echo $((0x100000000+(128< /sys/devices/system/memory/probe
$ grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable$ echo $((0x100000000+2*(128< /sys/devices/system/memory/probe
$ grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal Movable$ echo online_movable > /sys/devices/system/memory/memory34/state
$ grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable NormalThis is an awkward semantic because an udev event is sent as soon as the
block is onlined and an udev handler might want to online it based on
some policy (e.g. association with a node) but it will inherently race
with new blocks showing up.This patch changes the physical online phase to not associate pages with
any zone at all. All the pages are just marked reserved and wait for
the onlining phase to be associated with the zone as per the online
request. There are only two requirements- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap
- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses
the latter one is not an inherent requirement and can be changed in the
future. It preserves the current behavior and made the code slightly
simpler. This is subject to change in future.This means that the same physical online steps as above will lead to the
following state: Normal Movable/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Normal Movable/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:MovableImplementation:
The current move_pfn_range is reimplemented to check the above
requirements (allow_online_pfn_range) and then updates the respective
zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
pfn range with the zone/node. __add_pages is updated to not require the
zone and only initializes sections in the range. This allowed to
simplify the arch_add_memory code (s390 could get rid of quite some of
code).devm_memremap_pages is the only user of arch_add_memory which relies on
the zone association because it only hooks into the memory hotplug only
half way. It uses it to associate the new memory with ZONE_DEVICE but
doesn't allow it to be {on,off}lined via sysfs. This means that this
particular code path has to call move_pfn_range_to_zone explicitly.The original zone shifting code is kept in place and will be removed in
the follow up patch for an easier review.Please note that this patch also changes the original behavior when
offlining a memory block adjacent to another zone (Normal vs. Movable)
used to allow to change its movable type. This will be handled later.[richard.weiyang@gmail.com: simplify zone_intersects()]
Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
[richard.weiyang@gmail.com: remove duplicate call for set_page_links]
Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
[akpm@linux-foundation.org: remove unused local `i']
Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org
Signed-off-by: Michal Hocko
Signed-off-by: Wei Yang
Tested-by: Dan Williams
Tested-by: Reza Arbab
Acked-by: Heiko Carstens # For s390 bits
Acked-by: Vlastimil Babka
Cc: Martin Schwidefsky
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__pageblock_pfn_to_page has two users currently, set_zone_contiguous
which checks whether the given zone contains holes and
pageblock_pfn_to_page which then carefully returns a first valid page
from the given pfn range for the given zone. This doesn't handle zones
which are not fully populated though. Memory pageblocks can be offlined
or might not have been onlined yet. In such a case the zone should be
considered to have holes otherwise pfn walkers can touch and play with
offline pages.Current callers of pageblock_pfn_to_page in compaction seem to work
properly right now because they only isolate PageBuddy
(isolate_freepages_block) or PageLRU resp. __PageMovable
(isolate_migratepages_block) which will be always false for these pages.
It would be safer to skip these pages altogether, though.In order to do this patch adds a new memory section state
(SECTION_IS_ONLINE) which is set in memory_present (during boot time) or
in online_pages_range during the memory hotplug. Similarly
offline_mem_sections clears the bit and it is called when the memory
range is offlined.pfn_to_online_page helper is then added which check the mem section and
only returns a page if it is onlined already.Use the new helper in __pageblock_pfn_to_page and skip the whole page
block in such a case.[mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil),
mark sections online after all struct pages are initialized in
online_pages_range (Vlastimil)]
Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Dan Williams
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Heiko Carstens
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are a number of times that we loop over NR_MEM_SECTIONS, looking
for section_present() on each section. But, when we have very large
physical address spaces (large MAX_PHYSMEM_BITS), NR_MEM_SECTIONS
becomes very large, making the loops quite long.With MAX_PHYSMEM_BITS=46 and a section size of 128MB, the current loops
are 512k iterations, which we barely notice on modern hardware. But,
raising MAX_PHYSMEM_BITS higher (like we will see on systems that
support 5-level paging) makes this 64x longer and we start to notice,
especially on slower systems like simulators. A 10-second delay for
512k iterations is annoying. But, a 640- second delay is crippling.This does not help if we have extremely sparse physical address spaces,
but those are quite rare. We expect that most of the "slow" systems
where this matters will also be quite small and non-sparse.To fix this, we track the highest section we've ever encountered. This
lets us know when we will *never* see another section_present(), and
lets us break out of the loops earlier.Doing the whole for_each_present_section_nr() macro is probably
overkill, but it will ensure that any future loop iterations that we
grow are more likely to be correct.Kirrill said "It shaved almost 40 seconds from boot time in qemu with
5-level paging enabled for me".Link: http://lkml.kernel.org/r/20170504174434.C45A4735@viggo.jf.intel.com
Signed-off-by: Dave Hansen
Tested-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 May, 2017
1 commit
-
The current implementation calculates usemap_size in two steps:
* calculate number of bytes to cover these bits
* calculate number of "unsigned long" to cover these bytesIt would be more clear by:
* calculate number of "unsigned long" to cover these bits
* multiple it with sizeof(unsigned long)This patch refine usemap_size() a little to make it more easy to
understand.Link: http://lkml.kernel.org/r/20170310043713.96871-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Feb, 2017
2 commits
-
To identify that pages of page table are allocated from bootmem
allocator, magic number sets to page->lru.next.But page->lru list is initialized in reserve_bootmem_region(). So when
calling free_pagetable(), the function cannot find the magic number of
pages. And free_pagetable() frees the pages by free_reserved_page() not
put_page_bootmem().But if the pages are allocated from bootmem allocator and used as page
table, the pages have private flag. So before freeing the pages, we
should clear the private flag by put_page_bootmem().Before applying the commit 7bfec6f47bb0 ("mm, page_alloc: check multiple
page fields with a single branch"), we could find the following visible
issue:BUG: Bad page state in process kworker/u1024:1
page:ffffea103cfd8040 count:0 mapcount:0 mappi
flags: 0x6fffff80000800(private)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x800(private)
Call Trace:
[...] dump_stack+0x63/0x87
[...] bad_page+0x114/0x130
[...] free_pages_prepare+0x299/0x2d0
[...] free_hot_cold_page+0x31/0x150
[...] __free_pages+0x25/0x30
[...] free_pagetable+0x6f/0xb4
[...] remove_pagetable+0x379/0x7ff
[...] vmemmap_free+0x10/0x20
[...] sparse_remove_one_section+0x149/0x180
[...] __remove_pages+0x2e9/0x4f0
[...] arch_remove_memory+0x63/0xc0
[...] remove_memory+0x8c/0xc0
[...] acpi_memory_device_remove+0x79/0xa5
[...] acpi_bus_trim+0x5a/0x8d
[...] acpi_bus_trim+0x38/0x8d
[...] acpi_device_hotplug+0x1b7/0x418
[...] acpi_hotplug_work_fn+0x1e/0x29
[...] process_one_work+0x152/0x400
[...] worker_thread+0x125/0x4b0
[...] kthread+0xd8/0xf0
[...] ret_from_fork+0x22/0x40And the issue still silently occurs.
Until freeing the pages of page table allocated from bootmem allocator,
the page->freelist is never used. So the patch sets magic number to
page->freelist instead of page->lru.next.[isimatu.yasuaki@jp.fujitsu.com: fix merge issue]
Link: http://lkml.kernel.org/r/722b1cc4-93ac-dd8b-2be2-7a7e313b3b0b@gmail.com
Link: http://lkml.kernel.org/r/2c29bd9f-5b67-02d0-18a3-8828e78bbb6f@gmail.com
Signed-off-by: Yasuaki Ishimatsu
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: H. Peter Anvin
Cc: Dave Hansen
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Xishi Qiu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
free_map_bootmem() uses page->private directly to set
removing_section_nr argument. But to get page->private value,
page_private() has been prepared.So free_map_bootmem() should use page_private() instead of
page->private.Link: http://lkml.kernel.org/r/1d34eaa5-a506-8b7a-6471-490c345deef8@gmail.com
Signed-off-by: Yasuaki Ishimatsu
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: H. Peter Anvin
Cc: Dave Hansen
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Xishi Qiu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Aug, 2016
1 commit
-
There was only one use of __initdata_refok and __exit_refok
__init_refok was used 46 times against 82 for __ref.
Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")This patch removes the following compatibility definitions and replaces
them treewide./* compatibility defines */
#define __init_refok __ref
#define __initdata_refok __refdata
#define __exit_refok __refI can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick
Cc: Ingo Molnar
Cc: Sam Ravnborg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Jul, 2016
1 commit
-
When CONFIG_SPARSEMEM_EXTREME is disabled, __section_nr can get the
section number with a subtraction directly.Link: http://lkml.kernel.org/r/1468988310-11560-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Zhou Chengming
Cc: Dave Hansen
Cc: Tejun Heo
Cc: Hanjun Guo
Cc: Li Bin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Mar, 2016
1 commit
-
Most of the mm subsystem uses pr_ so make it consistent.
Miscellanea:
- Realign arguments
- Add missing newline to format
- kmemleak-test.c has a "kmemleak: " prefix added to the
"Kmemleak testing" logging message via pr_fmtSigned-off-by: Joe Perches
Acked-by: Tejun Heo [percpu]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds