16 May, 2018
1 commit
-
commit 27227c733852f71008e9bf165950bb2edaed3a90 upstream.
Memory hotplug and hotremove operate with per-block granularity. If the
machine has a large amount of memory (more than 64G), the size of a
memory block can span multiple sections. By mistake, during hotremove
we set only the first section to offline state.The bug was discovered because kernel selftest started to fail:
https://lkml.kernel.org/r/20180423011247.GK5563@yexl-desktopAfter commit, "mm/memory_hotplug: optimize probe routine". But, the bug
is older than this commit. In this optimization we also added a check
for sections to be in a proper state during hotplug operation.Link: http://lkml.kernel.org/r/20180427145257.15222-1-pasha.tatashin@oracle.com
Fixes: 2d070eab2e82 ("mm: consider zone which is not fully populated to have holes")
Signed-off-by: Pavel Tatashin
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Cc: Vlastimil Babka
Cc: Steven Sistare
Cc: Daniel Jordan
Cc: "Kirill A. Shutemov"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
10 Jan, 2018
1 commit
-
commit d09cfbbfa0f761a97687828b5afb27b56cbf2e19 upstream.
In commit 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime
for CONFIG_SPARSEMEM_EXTREME=y") mem_section is allocated at runtime to
save memory.It allocates the first dimension of array with sizeof(struct mem_section).
It costs extra memory, should be sizeof(struct mem_section *).
Fix it.
Link: http://lkml.kernel.org/r/1513932498-20350-1-git-send-email-bhe@redhat.com
Fixes: 83e3c48729 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
Signed-off-by: Baoquan He
Tested-by: Dave Young
Acked-by: Kirill A. Shutemov
Cc: Kirill A. Shutemov
Cc: Ingo Molnar
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Atsushi Kumagai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
25 Dec, 2017
2 commits
-
commit 629a359bdb0e0652a8227b4ff3125431995fec6e upstream.
Since commit:
83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
we allocate the mem_section array dynamically in sparse_memory_present_with_active_regions(),
but some architectures, like arm64, don't call the routine to initialize sparsemem.Let's move the initialization into memory_present() it should cover all
architectures.Reported-and-tested-by: Sudeep Holla
Tested-by: Bjorn Andersson
Signed-off-by: Kirill A. Shutemov
Acked-by: Will Deacon
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-mm@kvack.org
Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
Link: http://lkml.kernel.org/r/20171107083337.89952-1-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar
Cc: Dan Rue
Cc: Naresh Kamboju
Signed-off-by: Greg Kroah-Hartman -
commit 83e3c48729d9ebb7af5a31a504f3fd6aff0348c4 upstream.
Size of the mem_section[] array depends on the size of the physical address space.
In preparation for boot-time switching between paging modes on x86-64
we need to make the allocation of mem_section[] dynamic, because otherwise
we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB
for 4-level paging and 2MB for 5-level paging mode.The patch allocates the array on the first call to sparse_memory_present_with_active_regions().
Signed-off-by: Kirill A. Shutemov
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Cyrill Gorcunov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170929140821.37654-2-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman
02 Nov, 2017
1 commit
-
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
09 Sep, 2017
1 commit
-
online_mem_sections() accidentally marks online only the first section
in the given range. This is a typo which hasn't been noticed because I
haven't tested large 2GB blocks previously. All users of
pfn_to_online_page would get confused on the the rest of the pfn range
in the block.All we need to fix this is to use iterator (pfn) rather than start_pfn.
Link: http://lkml.kernel.org/r/20170904112210.3401-1-mhocko@kernel.org
Fixes: 2d070eab2e82 ("mm: consider zone which is not fully populated to have holes")
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Anshuman Khandual
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Sep, 2017
1 commit
-
Commit f52407ce2dea ("memory hotplug: alloc page from other node in
memory online") has introduced N_HIGH_MEMORY checks to only use NUMA
aware allocations when there is some memory present because the
respective node might not have any memory yet at the time and so it
could fail or even OOM.Things have changed since then though. Zonelists are now always
initialized before we do any allocations even for hotplug (see
959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug
zonelist")).Therefore these checks are not really needed. In fact caller of the
allocator should never care about whether the node is populated because
that might change at any time.Link: http://lkml.kernel.org/r/20170721143915.14161-10-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Shaohua Li
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Toshi Kani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Jul, 2017
3 commits
-
The current memory hotplug implementation relies on having all the
struct pages associate with a zone/node during the physical hotplug
phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the
vast majority of cases this means that they are added to ZONE_NORMAL.
This has been so since 9d99aaa31f59 ("[PATCH] x86_64: Support memory
hotadd without sparsemem") and it wasn't a big deal back then because
movable onlining didn't exist yet.Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
onlining 511c2aba8f07 ("mm, memory-hotplug: dynamic configure movable
memory and portion memory") and then things got more complicated.
Rather than reconsidering the zone association which was no longer
needed (because the memory hotplug already depended on SPARSEMEM) a
convoluted semantic of zone shifting has been developed. Only the
currently last memblock or the one adjacent to the zone_movable can be
onlined movable. This essentially means that the online type changes as
the new memblocks are added.Let's simulate memory hot online manually
$ echo 0x100000000 > /sys/devices/system/memory/probe
$ grep . /sys/devices/system/memory/memory32/valid_zones
Normal Movable$ echo $((0x100000000+(128< /sys/devices/system/memory/probe
$ grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable$ echo $((0x100000000+2*(128< /sys/devices/system/memory/probe
$ grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal Movable$ echo online_movable > /sys/devices/system/memory/memory34/state
$ grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable NormalThis is an awkward semantic because an udev event is sent as soon as the
block is onlined and an udev handler might want to online it based on
some policy (e.g. association with a node) but it will inherently race
with new blocks showing up.This patch changes the physical online phase to not associate pages with
any zone at all. All the pages are just marked reserved and wait for
the onlining phase to be associated with the zone as per the online
request. There are only two requirements- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap
- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses
the latter one is not an inherent requirement and can be changed in the
future. It preserves the current behavior and made the code slightly
simpler. This is subject to change in future.This means that the same physical online steps as above will lead to the
following state: Normal Movable/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Normal Movable/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:MovableImplementation:
The current move_pfn_range is reimplemented to check the above
requirements (allow_online_pfn_range) and then updates the respective
zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
pfn range with the zone/node. __add_pages is updated to not require the
zone and only initializes sections in the range. This allowed to
simplify the arch_add_memory code (s390 could get rid of quite some of
code).devm_memremap_pages is the only user of arch_add_memory which relies on
the zone association because it only hooks into the memory hotplug only
half way. It uses it to associate the new memory with ZONE_DEVICE but
doesn't allow it to be {on,off}lined via sysfs. This means that this
particular code path has to call move_pfn_range_to_zone explicitly.The original zone shifting code is kept in place and will be removed in
the follow up patch for an easier review.Please note that this patch also changes the original behavior when
offlining a memory block adjacent to another zone (Normal vs. Movable)
used to allow to change its movable type. This will be handled later.[richard.weiyang@gmail.com: simplify zone_intersects()]
Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
[richard.weiyang@gmail.com: remove duplicate call for set_page_links]
Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
[akpm@linux-foundation.org: remove unused local `i']
Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org
Signed-off-by: Michal Hocko
Signed-off-by: Wei Yang
Tested-by: Dan Williams
Tested-by: Reza Arbab
Acked-by: Heiko Carstens # For s390 bits
Acked-by: Vlastimil Babka
Cc: Martin Schwidefsky
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__pageblock_pfn_to_page has two users currently, set_zone_contiguous
which checks whether the given zone contains holes and
pageblock_pfn_to_page which then carefully returns a first valid page
from the given pfn range for the given zone. This doesn't handle zones
which are not fully populated though. Memory pageblocks can be offlined
or might not have been onlined yet. In such a case the zone should be
considered to have holes otherwise pfn walkers can touch and play with
offline pages.Current callers of pageblock_pfn_to_page in compaction seem to work
properly right now because they only isolate PageBuddy
(isolate_freepages_block) or PageLRU resp. __PageMovable
(isolate_migratepages_block) which will be always false for these pages.
It would be safer to skip these pages altogether, though.In order to do this patch adds a new memory section state
(SECTION_IS_ONLINE) which is set in memory_present (during boot time) or
in online_pages_range during the memory hotplug. Similarly
offline_mem_sections clears the bit and it is called when the memory
range is offlined.pfn_to_online_page helper is then added which check the mem section and
only returns a page if it is onlined already.Use the new helper in __pageblock_pfn_to_page and skip the whole page
block in such a case.[mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil),
mark sections online after all struct pages are initialized in
online_pages_range (Vlastimil)]
Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Dan Williams
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Heiko Carstens
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are a number of times that we loop over NR_MEM_SECTIONS, looking
for section_present() on each section. But, when we have very large
physical address spaces (large MAX_PHYSMEM_BITS), NR_MEM_SECTIONS
becomes very large, making the loops quite long.With MAX_PHYSMEM_BITS=46 and a section size of 128MB, the current loops
are 512k iterations, which we barely notice on modern hardware. But,
raising MAX_PHYSMEM_BITS higher (like we will see on systems that
support 5-level paging) makes this 64x longer and we start to notice,
especially on slower systems like simulators. A 10-second delay for
512k iterations is annoying. But, a 640- second delay is crippling.This does not help if we have extremely sparse physical address spaces,
but those are quite rare. We expect that most of the "slow" systems
where this matters will also be quite small and non-sparse.To fix this, we track the highest section we've ever encountered. This
lets us know when we will *never* see another section_present(), and
lets us break out of the loops earlier.Doing the whole for_each_present_section_nr() macro is probably
overkill, but it will ensure that any future loop iterations that we
grow are more likely to be correct.Kirrill said "It shaved almost 40 seconds from boot time in qemu with
5-level paging enabled for me".Link: http://lkml.kernel.org/r/20170504174434.C45A4735@viggo.jf.intel.com
Signed-off-by: Dave Hansen
Tested-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 May, 2017
1 commit
-
The current implementation calculates usemap_size in two steps:
* calculate number of bytes to cover these bits
* calculate number of "unsigned long" to cover these bytesIt would be more clear by:
* calculate number of "unsigned long" to cover these bits
* multiple it with sizeof(unsigned long)This patch refine usemap_size() a little to make it more easy to
understand.Link: http://lkml.kernel.org/r/20170310043713.96871-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Feb, 2017
2 commits
-
To identify that pages of page table are allocated from bootmem
allocator, magic number sets to page->lru.next.But page->lru list is initialized in reserve_bootmem_region(). So when
calling free_pagetable(), the function cannot find the magic number of
pages. And free_pagetable() frees the pages by free_reserved_page() not
put_page_bootmem().But if the pages are allocated from bootmem allocator and used as page
table, the pages have private flag. So before freeing the pages, we
should clear the private flag by put_page_bootmem().Before applying the commit 7bfec6f47bb0 ("mm, page_alloc: check multiple
page fields with a single branch"), we could find the following visible
issue:BUG: Bad page state in process kworker/u1024:1
page:ffffea103cfd8040 count:0 mapcount:0 mappi
flags: 0x6fffff80000800(private)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x800(private)
Call Trace:
[...] dump_stack+0x63/0x87
[...] bad_page+0x114/0x130
[...] free_pages_prepare+0x299/0x2d0
[...] free_hot_cold_page+0x31/0x150
[...] __free_pages+0x25/0x30
[...] free_pagetable+0x6f/0xb4
[...] remove_pagetable+0x379/0x7ff
[...] vmemmap_free+0x10/0x20
[...] sparse_remove_one_section+0x149/0x180
[...] __remove_pages+0x2e9/0x4f0
[...] arch_remove_memory+0x63/0xc0
[...] remove_memory+0x8c/0xc0
[...] acpi_memory_device_remove+0x79/0xa5
[...] acpi_bus_trim+0x5a/0x8d
[...] acpi_bus_trim+0x38/0x8d
[...] acpi_device_hotplug+0x1b7/0x418
[...] acpi_hotplug_work_fn+0x1e/0x29
[...] process_one_work+0x152/0x400
[...] worker_thread+0x125/0x4b0
[...] kthread+0xd8/0xf0
[...] ret_from_fork+0x22/0x40And the issue still silently occurs.
Until freeing the pages of page table allocated from bootmem allocator,
the page->freelist is never used. So the patch sets magic number to
page->freelist instead of page->lru.next.[isimatu.yasuaki@jp.fujitsu.com: fix merge issue]
Link: http://lkml.kernel.org/r/722b1cc4-93ac-dd8b-2be2-7a7e313b3b0b@gmail.com
Link: http://lkml.kernel.org/r/2c29bd9f-5b67-02d0-18a3-8828e78bbb6f@gmail.com
Signed-off-by: Yasuaki Ishimatsu
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: H. Peter Anvin
Cc: Dave Hansen
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Xishi Qiu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
free_map_bootmem() uses page->private directly to set
removing_section_nr argument. But to get page->private value,
page_private() has been prepared.So free_map_bootmem() should use page_private() instead of
page->private.Link: http://lkml.kernel.org/r/1d34eaa5-a506-8b7a-6471-490c345deef8@gmail.com
Signed-off-by: Yasuaki Ishimatsu
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: H. Peter Anvin
Cc: Dave Hansen
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Xishi Qiu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Aug, 2016
1 commit
-
There was only one use of __initdata_refok and __exit_refok
__init_refok was used 46 times against 82 for __ref.
Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")This patch removes the following compatibility definitions and replaces
them treewide./* compatibility defines */
#define __init_refok __ref
#define __initdata_refok __refdata
#define __exit_refok __refI can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick
Cc: Ingo Molnar
Cc: Sam Ravnborg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Jul, 2016
1 commit
-
When CONFIG_SPARSEMEM_EXTREME is disabled, __section_nr can get the
section number with a subtraction directly.Link: http://lkml.kernel.org/r/1468988310-11560-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Zhou Chengming
Cc: Dave Hansen
Cc: Tejun Heo
Cc: Hanjun Guo
Cc: Li Bin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Mar, 2016
2 commits
-
Most of the mm subsystem uses pr_ so make it consistent.
Miscellanea:
- Realign arguments
- Add missing newline to format
- kmemleak-test.c has a "kmemleak: " prefix added to the
"Kmemleak testing" logging message via pr_fmtSigned-off-by: Joe Perches
Acked-by: Tejun Heo [percpu]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Kernel style prefers a single string over split strings when the string is
'user-visible'.Miscellanea:
- Add a missing newline
- Realign argumentsSigned-off-by: Joe Perches
Acked-by: Tejun Heo [percpu]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Jan, 2016
1 commit
-
In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array. The default vmemmap_populate()
allocates page table storage area from the page allocator. Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'. Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.Signed-off-by: Dan Williams
Reported-by: kbuild test robot
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 Apr, 2014
1 commit
-
To increase compiler portability there is which
provides convenience macros for various gcc constructs. Eg: __weak for
__attribute__((weak)). I've replaced all instances of gcc attributes with
the right macro in the memory management (/mm) subsystem.[akpm@linux-foundation.org: while-we're-there consistency tweaks]
Signed-off-by: Gideon Israel Dsouza
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 Apr, 2014
1 commit
-
retmain -> remain
Signed-off-by: Li Zhong
Signed-off-by: Jiri Kosina
22 Jan, 2014
1 commit
-
Switch to memblock interfaces for early memory allocator instead of
bootmem allocator. No functional change in beahvior than what it is in
current code from bootmem users points of view.Archs already converted to NO_BOOTMEM now directly use memblock
interfaces instead of bootmem wrappers build on top of memblock. And
the archs which still uses bootmem, these new apis just fallback to
exiting bootmem APIs.Signed-off-by: Santosh Shilimkar
Cc: "Rafael J. Wysocki"
Cc: Arnd Bergmann
Cc: Christoph Lameter
Cc: Greg Kroah-Hartman
Cc: Grygorii Strashko
Cc: H. Peter Anvin
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Konrad Rzeszutek Wilk
Cc: Michal Hocko
Cc: Paul Walmsley
Cc: Pavel Machek
Cc: Russell King
Cc: Tejun Heo
Cc: Tony Lindgren
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Nov, 2013
2 commits
-
We pass the number of pages which hold page structs of a memory section
to free_map_bootmem(). This is right when !CONFIG_SPARSEMEM_VMEMMAP but
wrong when CONFIG_SPARSEMEM_VMEMMAP. When CONFIG_SPARSEMEM_VMEMMAP, we
should pass the number of pages of a memory section to free_map_bootmem.So the fix is removing the nr_pages parameter. When
CONFIG_SPARSEMEM_VMEMMAP, we directly use the prefined marco
PAGES_PER_SECTION in free_map_bootmem. When !CONFIG_SPARSEMEM_VMEMMAP,
we calculate page numbers needed to hold the page structs for a memory
section and use the value in free_map_bootmem().This was found by reading the code. And I have no machine that support
memory hot-remove to test the bug now.Signed-off-by: Zhang Yanfei
Reviewed-by: Wanpeng Li
Cc: Wen Congyang
Cc: Tang Chen
Cc: Toshi Kani
Cc: Yasuaki Ishimatsu
Cc: Yinghai Lu
Cc: Yasunori Goto
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
For below functions,
- sparse_add_one_section()
- kmalloc_section_memmap()
- __kmalloc_section_memmap()
- __kfree_section_memmap()they are always invoked to operate on one memory section, so it is
redundant to always pass a nr_pages parameter, which is the page numbers
in one section. So we can directly use predefined macro PAGES_PER_SECTION
instead of passing the parameter.Signed-off-by: Zhang Yanfei
Cc: Wen Congyang
Cc: Tang Chen
Cc: Toshi Kani
Cc: Yasuaki Ishimatsu
Cc: Yinghai Lu
Cc: Yasunori Goto
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Sep, 2013
1 commit
-
After commit 9bdac9142407 ("sparsemem: Put mem map for one node
together."), vmemmap for one node will be allocated together, its logic
is similar as memory allocation for pageblock flags. This patch
introduces alloc_usemap_and_memmap to extract the same logic of memory
alloction for pageblock flags and vmemmap.Signed-off-by: Wanpeng Li
Cc: Dave Hansen
Cc: Rik van Riel
Cc: Fengguang Wu
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: KOSAKI Motohiro
Cc: Jiri Kosina
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Jul, 2013
1 commit
-
With CONFIG_MEMORY_HOTREMOVE unset, there is a compile warning:
mm/sparse.c:755: warning: `clear_hwpoisoned_pages' defined but not used
And Bisecting it ended up pointing to 4edd7ceff ("mm, hotplug: avoid
compiling memory hotremove functions when disabled").This is because the commit above put sparse_remove_one_section() within
the protection of CONFIG_MEMORY_HOTREMOVE but the only user of
clear_hwpoisoned_pages() is sparse_remove_one_section(), and it is not
within the protection of CONFIG_MEMORY_HOTREMOVE.So put clear_hwpoisoned_pages within CONFIG_MEMORY_HOTREMOVE should fix
the warning.Signed-off-by: Zhang Yanfei
Cc: David Rientjes
Acked-by: Toshi Kani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jul, 2013
1 commit
-
Pull trivial tree updates from Jiri Kosina:
"The usual stuff from trivial tree"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
treewide: relase -> release
Documentation/cgroups/memory.txt: fix stat file documentation
sysctl/net.txt: delete reference to obsolete 2.4.x kernel
spinlock_api_smp.h: fix preprocessor comments
treewide: Fix typo in printk
doc: device tree: clarify stuff in usage-model.txt.
open firmware: "/aliasas" -> "/aliases"
md: bcache: Fixed a typo with the word 'arithmetic'
irq/generic-chip: fix a few kernel-doc entries
frv: Convert use of typedef ctl_table to struct ctl_table
sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
doc: clk: Fix incorrect wording
Documentation/arm/IXP4xx fix a typo
Documentation/networking/ieee802154 fix a typo
Documentation/DocBook/media/v4l fix a typo
Documentation/video4linux/si476x.txt fix a typo
Documentation/virtual/kvm/api.txt fix a typo
Documentation/early-userspace/README fix a typo
Documentation/video4linux/soc-camera.txt fix a typo
lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
...
04 Jul, 2013
1 commit
-
Instead of leaving a hidden trap for the next person who comes along and
wants to add something to mem_section, add a big fat warning about it
needing to be a power-of-2, and insert a BUILD_BUG_ON() in sparse_init()
to catch mistakes.Right now non-power-of-2 mem_sections cause a number of WARNs at boot
(which don't clearly point to the size of mem_section as an issue), but
the system limps on (temporarily, at least).This is based upon Dave Hansen's earlier RFC where he ran into the same
issue:
"sparsemem: fix boot when SECTIONS_PER_ROOT is not power-of-2"
http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/03077.htmlSigned-off-by: Cody P Schafer
Acked-by: Dave Hansen
Cc: Jiang Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 May, 2013
1 commit
-
The ret variable is not used in the function, so remove it and
directly return 0 at the end of the function.Signed-off-by: Zhang Yanfei
Signed-off-by: Jiri Kosina
30 Apr, 2013
2 commits
-
__remove_pages() is only necessary for CONFIG_MEMORY_HOTREMOVE. PowerPC
pseries will return -EOPNOTSUPP if unsupported.Adding an #ifdef causes several other functions it depends on to also
become unnecessary, which saves in .text when disabled (it's disabled in
most defconfigs besides powerpc, including x86). remove_memory_block()
becomes static since it is not referenced outside of
drivers/base/memory.c.Build tested on x86 and powerpc with CONFIG_MEMORY_HOTREMOVE both enabled
and disabled.Signed-off-by: David Rientjes
Acked-by: Toshi Kani
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Greg Kroah-Hartman
Cc: Wen Congyang
Cc: Tang Chen
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The sparse code, when asking the architecture to populate the vmemmap,
specifies the section range as a starting page and a number of pages.This is an awkward interface, because none of the arch-specific code
actually thinks of the range in terms of 'struct page' units and always
translates it to bytes first.In addition, later patches mix huge page and regular page backing for
the vmemmap. For this, they need to call vmemmap_populate_basepages()
on sub-section ranges with PAGE_SIZE and PMD_SIZE in mind. But these
are not necessarily multiples of the 'struct page' size and so this unit
is too coarse.Just translate the section range into bytes once in the generic sparse
code, then pass byte ranges down the stack.Signed-off-by: Johannes Weiner
Cc: Ben Hutchings
Cc: Bernhard Schmidt
Cc: Johannes Weiner
Cc: Russell King
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Benjamin Herrenschmidt
Cc: "Luck, Tony"
Cc: Heiko Carstens
Acked-by: David S. Miller
Tested-by: David S. Miller
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Feb, 2013
4 commits
-
Since MCE is an x86 concept, and this code is in mm/, it would be better
to use the name num_poisoned_pages instead of mce_bad_pages.[akpm@linux-foundation.org: fix mm/sparse.c]
Signed-off-by: Xishi Qiu
Signed-off-by: Jiang Liu
Suggested-by: Borislav Petkov
Reviewed-by: Wanpeng Li
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
usemap could also be allocated as compound pages. Should also consider
compound pages when freeing memmap.If we don't fix it, there could be problems when we free vmemmap
pagetables which are stored in compound pages. The old pagetables will
not be freed properly, and when we add the memory again, no new
pagetable will be created. And the old pagetable entry is used, than
the kernel will panic.The call trace is like the following:
BUG: unable to handle kernel paging request at ffffea0040000000
IP: [] sparse_add_one_section+0xef/0x166
PGD 7ff7d4067 PUD 78e035067 PMD 78e11d067 PTE 0
Oops: 0002 [#1] SMP
Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg lpc_ich mfd_core i2c_i801 i2c_core i7core_edac edac_core ioatdma e1000e igb dca ptp pps_core sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
CPU 0
Pid: 4, comm: kworker/0:0 Tainted: G W 3.8.0-rc3-phy-hot-remove+ #3 FUJITSU-SV PRIMEQUEST 1800E/SB
RIP: 0010:[] [] sparse_add_one_section+0xef/0x166
RSP: 0018:ffff8807bdcb35d8 EFLAGS: 00010006
RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000200000
RDX: ffff88078df01148 RSI: 0000000000000282 RDI: ffffea0040000000
RBP: ffff8807bdcb3618 R08: 4cf05005b019467a R09: 0cd98fa09631467a
R10: 0000000000000000 R11: 0000000000030e20 R12: 0000000000008000
R13: ffffea0040000000 R14: ffff88078df66248 R15: ffff88078ea13b10
FS: 0000000000000000(0000) GS:ffff8807c1a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffea0040000000 CR3: 0000000001c0c000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/0:0 (pid: 4, threadinfo ffff8807bdcb2000, task ffff8807bde18000)
Call Trace:
__add_pages+0x85/0x120
arch_add_memory+0x71/0xf0
add_memory+0xd6/0x1f0
acpi_memory_device_add+0x170/0x20c
acpi_device_probe+0x50/0x18a
really_probe+0x6c/0x320
driver_probe_device+0x47/0xa0
__device_attach+0x53/0x60
bus_for_each_drv+0x6c/0xa0
device_attach+0xa8/0xc0
bus_probe_device+0xb0/0xe0
device_add+0x301/0x570
device_register+0x1e/0x30
acpi_device_register+0x1d8/0x27c
acpi_add_single_object+0x1df/0x2b9
acpi_bus_check_add+0x112/0x18f
acpi_ns_walk_namespace+0x105/0x255
acpi_walk_namespace+0xcf/0x118
acpi_bus_scan+0x5b/0x7c
acpi_bus_add+0x2a/0x2c
container_notify_cb+0x112/0x1a9
acpi_ev_notify_dispatch+0x46/0x61
acpi_os_execute_deferred+0x27/0x34
process_one_work+0x20e/0x5c0
worker_thread+0x12e/0x370
kthread+0xee/0x100
ret_from_fork+0x7c/0xb0
Code: 00 00 48 89 df 48 89 45 c8 e8 3e 71 b1 ff 48 89 c2 48 8b 75 c8 b8 ef ff ff ff f6 02 01 75 4b 49 63 cc 31 c0 4c 89 ef 48 c1 e1 06 aa 48 8b 02 48 83 c8 01 48 85 d2 48 89 02 74 29 a8 01 74 25
RIP [] sparse_add_one_section+0xef/0x166
RSP
CR2: ffffea0040000000
---[ end trace e7f94e3a34c442d4 ]---
Kernel panic - not syncing: Fatal exceptionSigned-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Introduce a new API vmemmap_free() to free and remove vmemmap
pagetables. Since pagetable implements are different, each architecture
has to provide its own version of vmemmap_free(), just like
vmemmap_populate().Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.
[mhocko@suse.cz: fix implicit declaration of remove_pagetable]
Signed-off-by: Yasuaki Ishimatsu
Signed-off-by: Jianguo Wu
Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In __remove_section(), we locked pgdat_resize_lock when calling
sparse_remove_one_section(). This lock will disable irq. But we don't
need to lock the whole function. If we do some work to free pagetables
in free_section_usemap(), we need to call flush_tlb_all(), which need
irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many()
will be triggered.If we lock the whole sparse_remove_one_section(), then we come to this call trace:
------------[ cut here ]------------
WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260()
Hardware name: PRIMEQUEST 1800E
......
Call Trace:
smp_call_function_many+0xbd/0x260
smp_call_function+0x3b/0x50
on_each_cpu+0x3b/0xc0
flush_tlb_all+0x1c/0x20
remove_pagetable+0x14e/0x1d0
vmemmap_free+0x18/0x20
sparse_remove_one_section+0xf7/0x100
__remove_section+0xa2/0xb0
__remove_pages+0xa0/0xd0
arch_remove_memory+0x6b/0xc0
remove_memory+0xb8/0xf0
acpi_memory_device_remove+0x53/0x96
acpi_device_remove+0x90/0xb2
__device_release_driver+0x7c/0xf0
device_release_driver+0x2f/0x50
acpi_bus_remove+0x32/0x6d
acpi_bus_trim+0x91/0x102
acpi_bus_hot_remove_device+0x88/0x16b
acpi_os_execute_deferred+0x27/0x34
process_one_work+0x20e/0x5c0
worker_thread+0x12e/0x370
kthread+0xee/0x100
ret_from_fork+0x7c/0xb0
---[ end trace 25e85300f542aa01 ]---Signed-off-by: Tang Chen
Signed-off-by: Lai Jiangshan
Signed-off-by: Wen Congyang
Acked-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Dec, 2012
2 commits
-
If sparse memory vmemmap is enabled, we can't free the memory to store
struct page when a memory device is hotremoved, because we may store
struct page in the memory to manage the memory which doesn't belong to
this memory device. When we hotadded this memory device again, we will
reuse this memory to store struct page, and struct page may contain some
obsolete information, and we will get bad-page state:init_memory_mapping: [mem 0x80000000-0x9fffffff]
Built 2 zonelists in Node order, mobility grouping on. Total pages: 547617
Policy zone: Normal
BUG: Bad page state in process bash pfn:9b6dc
page:ffffea0002200020 count:0 mapcount:0 mapping: (null) index:0xfdfdfdfdfdfdfdfd
page flags: 0x2fdfdfdfd5df9fd(locked|referenced|uptodate|dirty|lru|active|slab|owner_priv_1|private|private_2|writeback|head|tail|swapcache|reclaim|swapbacked|unevictable|uncached|compound_lock)
Modules linked in: netconsole acpiphp pci_hotplug acpi_memhotplug loop kvm_amd kvm microcode tpm_tis tpm tpm_bios evdev psmouse serio_raw i2c_piix4 i2c_core parport_pc parport processor button thermal_sys ext3 jbd mbcache sg sr_mod cdrom ata_generic virtio_net ata_piix virtio_blk libata virtio_pci virtio_ring virtio scsi_mod
Pid: 988, comm: bash Not tainted 3.6.0-rc7-guest #12
Call Trace:
[] ? bad_page+0xb0/0x100
[] ? free_pages_prepare+0xb3/0x100
[] ? free_hot_cold_page+0x48/0x1a0
[] ? online_pages_range+0x68/0xa0
[] ? __online_page_increment_counters+0x10/0x10
[] ? walk_system_ram_range+0x101/0x110
[] ? online_pages+0x1a5/0x2b0
[] ? __memory_block_change_state+0x20d/0x270
[] ? store_mem_state+0xb6/0xf0
[] ? sysfs_write_file+0xd2/0x160
[] ? vfs_write+0xaa/0x160
[] ? sys_write+0x47/0x90
[] ? async_page_fault+0x25/0x30
[] ? system_call_fastpath+0x16/0x1b
Disabling lock debugging due to kernel taintThis patch clears the memory to store struct page to avoid unexpected error.
Signed-off-by: Wen Congyang
Cc: David Rientjes
Cc: Jiang Liu
Cc: Minchan Kim
Acked-by: KOSAKI Motohiro
Cc: Yasuaki Ishimatsu
Reported-by: Vasilis Liaskovitis
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When we hotremove a memory device, we will free the memory to store struct
page. If the page is hwpoisoned page, we should decrease mce_bad_pages.[akpm@linux-foundation.org: cleanup ifdefs]
Signed-off-by: Wen Congyang
Cc: David Rientjes
Cc: Jiang Liu
Cc: Len Brown
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Christoph Lameter
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Yasuaki Ishimatsu
Cc: Dave Hansen
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Dec, 2012
1 commit
-
I enable CONFIG_DEBUG_VIRTUAL and CONFIG_SPARSEMEM_VMEMMAP, when doing
memory hotremove, there is a kernel BUG at arch/x86/mm/physaddr.c:20.It is caused by free_section_usemap()->virt_to_page(), virt_to_page() is
only used for kernel direct mapping address, but sparse-vmemmap uses
vmemmap address, so it is going wrong here.------------[ cut here ]------------
kernel BUG at arch/x86/mm/physaddr.c:20!
invalid opcode: 0000 [#1] SMP
Modules linked in: acpihp_drv acpihp_slot edd cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_intel ipv6 ixgbe igb iTCO_wdt i7core_edac edac_core pcspkr iTCO_vendor_support ioatdma microcode joydev sr_mod i2c_i801 dca lpc_ich mfd_core mdio tpm_tis i2c_core hid_generic tpm cdrom sg tpm_bios rtc_cmos button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hwmon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_piix libata megaraid_sas scsi_mod
CPU 39
Pid: 6454, comm: sh Not tainted 3.7.0-rc1-acpihp-final+ #45 QCI QSSC-S4R/QSSC-S4R
RIP: 0010:[] [] __phys_addr+0x88/0x90
RSP: 0018:ffff8804440d7c08 EFLAGS: 00010006
RAX: 0000000000000006 RBX: ffffea0012000000 RCX: 000000000000002c
...Signed-off-by: Jianguo Wu
Signed-off-by: Jiang Liu
Reviewd-by: Wen Congyang
Acked-by: Johannes Weiner
Reviewed-by: Yasuaki Ishimatsu
Reviewed-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Aug, 2012
3 commits
-
sparse_index_init() uses the index_init_lock spinlock to protect root
mem_section assignment. The lock is not necessary anymore because the
function is called only during boot (during paging init which is executed
only from a single CPU) and from the hotplug code (by add_memory() via
arch_add_memory()) which uses mem_hotplug_mutex.The lock was introduced by 28ae55c9 ("sparsemem extreme: hotplug
preparation") and sparse_index_init() was used only during boot at that
time.Later when the hotplug code (and add_memory()) was introduced there was no
synchronization so it was possible to online more sections from the same
root probably (though I am not 100% sure about that). The first
synchronization has been added by 6ad696d2 ("mm: allow memory hotplug and
hibernation in the same kernel") which was later replaced by the
mem_hotplug_mutex - 20d6c96b ("mem-hotplug: introduce
{un}lock_memory_hotplug()").Let's remove the lock as it is not needed and it makes the code more
confusing.[mhocko@suse.cz: changelog]
Signed-off-by: Gavin Shan
Reviewed-by: Michal Hocko
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__section_nr() was implemented to retrieve the corresponding memory
section number according to its descriptor. It's possible that the
specified memory section descriptor doesn't exist in the global array. So
add more checking on that and report an error for a wrong case.Signed-off-by: Gavin Shan
Acked-by: David Rientjes
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
With CONFIG_SPARSEMEM_EXTREME, the two levels of memory section
descriptors are allocated from slab or bootmem. When allocating from
slab, let slab/bootmem allocator clear the memory chunk. We needn't clear
it explicitly.Signed-off-by: Gavin Shan
Reviewed-by: Michal Hocko
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds