09 Sep, 2015
13 commits
-
Merge second patch-bomb from Andrew Morton:
"Almost all of the rest of MM. There was an unusually large amount of
MM material this time"* emailed patches from Andrew Morton : (141 commits)
zpool: remove no-op module init/exit
mm: zbud: constify the zbud_ops
mm: zpool: constify the zpool_ops
mm: swap: zswap: maybe_preload & refactoring
zram: unify error reporting
zsmalloc: remove null check from destroy_handle_cache()
zsmalloc: do not take class lock in zs_shrinker_count()
zsmalloc: use class->pages_per_zspage
zsmalloc: consider ZS_ALMOST_FULL as migrate source
zsmalloc: partial page ordering within a fullness_list
zsmalloc: use shrinker to trigger auto-compaction
zsmalloc: account the number of compacted pages
zsmalloc/zram: introduce zs_pool_stats api
zsmalloc: cosmetic compaction code adjustments
zsmalloc: introduce zs_can_compact() function
zsmalloc: always keep per-class stats
zsmalloc: drop unused variable `nr_to_migrate'
mm/memblock.c: fix comment in __next_mem_range()
mm/page_alloc.c: fix type information of memoryless node
memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
... -
For a memoryless node, the output of get_pfn_range_for_nid are all zero.
It will display mem from 0 to -1.Signed-off-by: Zhen Lei
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When hot adding a node from add_memory(), we will add memblock first, so
the node is not empty. But when called from cpu_up(), the node should
be empty.Signed-off-by: Xishi Qiu
Cc: Tang Chen
Cc: Yasuaki Ishimatsu
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Cc: Taku Izumi \
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We use sysctl_lowmem_reserve_ratio rather than
sysctl_lower_zone_reserve_ratio to determine how aggressive the kernel
is in defending lowmem from the possibility of being captured into
pinned user memory. To avoid misleading, correct it in some comments.Signed-off-by: Yaowei Bai
Acked-by: Michal Hocko
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The comment says that the per-cpu batchsize and zone watermarks are
determined by present_pages which is definitely wrong, they are both
calculated from managed_pages. Fix it.Signed-off-by: Yaowei Bai
Acked-by: Michal Hocko
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.The misleading name has lead to mistakes in the past, see for example
commits 5265047ac301 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f8e ("mm, mempolicy:
migrate_to_node should only migrate to node").Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.Both differences will be rectified by the next patch.
To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.Signed-off-by: Vlastimil Babka
Acked-by: Johannes Weiner
Acked-by: Robin Holt
Acked-by: Michal Hocko
Acked-by: Christoph Lameter
Acked-by: Michael Ellerman
Cc: Mel Gorman
Cc: David Rientjes
Cc: Greg Thelen
Cc: Aneesh Kumar K.V
Cc: Pekka Enberg
Cc: Joonsoo Kim
Cc: Naoya Horiguchi
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Gleb Natapov
Cc: Paolo Bonzini
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Cliff Whickman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The pair of get/set_freepage_migratetype() functions are used to cache
pageblock migratetype for a page put on a pcplist, so that it does not
have to be retrieved again when the page is put on a free list (e.g.
when pcplists become full). Historically it was also assumed that the
value is accurate for pages on freelists (as the functions' names
unfortunately suggest), but that cannot be guaranteed without affecting
various allocator fast paths. It is in fact not needed and all such
uses have been removed.The last remaining (but pointless) usage related to pages of freelists
is in move_freepages(), which this patch removes.To prevent further confusion, rename the functions to
get/set_pcppage_migratetype() and expand their description. Since all
the users are now in mm/page_alloc.c, move the functions there from the
shared header.Signed-off-by: Vlastimil Babka
Acked-by: David Rientjes
Acked-by: Joonsoo Kim
Cc: Minchan Kim
Acked-by: Michal Nazarewicz
Cc: Laura Abbott
Reviewed-by: Naoya Horiguchi
Cc: Seungho Park
Cc: Johannes Weiner
Cc: "Kirill A. Shutemov"
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The __test_page_isolated_in_pageblock() is used to verify whether all
pages in pageblock were either successfully isolated, or are hwpoisoned.
Two of the possible state of pages, that are tested, are however bogus
and misleading.Both tests rely on get_freepage_migratetype(page), which however has no
guarantees about pages on freelists. Specifically, it doesn't guarantee
that the migratetype returned by the function actually matches the
migratetype of the freelist that the page is on. Such guarantee is not
its purpose and would have negative impact on allocator performance.The first test checks whether the freepage_migratetype equals
MIGRATE_ISOLATE, supposedly to catch races between page isolation and
allocator activity. These races should be fixed nowadays with
51bb1a4093 ("mm/page_alloc: add freepage on isolate pageblock to correct
buddy list") and related patches. As explained above, the check
wouldn't be able to catch them reliably anyway. For the same reason
false positives can happen, although they are harmless, as the
move_freepages() call would just move the page to the same freelist it's
already on. So removing the test is not a bug fix, just cleanup. After
this patch, we assume that all PageBuddy pages are on the correct
freelist and that the races were really fixed. A truly reliable
verification in the form of e.g. VM_BUG_ON() would be complicated and
is arguably not needed.The second test (page_count(page) == 0 && get_freepage_migratetype(page)
== MIGRATE_ISOLATE) is probably supposed (the code comes from a big
memory isolation patch from 2007) to catch pages on MIGRATE_ISOLATE
pcplists. However, pcplists don't contain MIGRATE_ISOLATE freepages
nowadays, those are freed directly to free lists, so the check is
obsolete. Remove it as well.Signed-off-by: Vlastimil Babka
Acked-by: Joonsoo Kim
Cc: Minchan Kim
Acked-by: Michal Nazarewicz
Cc: Laura Abbott
Reviewed-by: Naoya Horiguchi
Cc: Seungho Park
Cc: Johannes Weiner
Cc: "Kirill A. Shutemov"
Acked-by: Mel Gorman
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The force_kill member of struct oom_control isn't needed if an order of -1
is used instead. This is the same as order == -1 in struct
compact_control which requires full memory compaction.This patch introduces no functional change.
Signed-off-by: David Rientjes
Cc: Sergey Senozhatsky
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are essential elements to an oom context that are passed around to
multiple functions.Organize these elements into a new struct, struct oom_control, that
specifies the context for an oom condition.This patch introduces no functional change.
Signed-off-by: David Rientjes
Acked-by: Michal Hocko
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit febd5949e134 ("mm/memory hotplug: init the zone's size when
calculating node totalpages") refines the function
free_area_init_core().After doing so, these two parameters are not used anymore.
This patch removes these two parameters.
Signed-off-by: Wei Yang
Cc: Gu Zheng
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
nr_node_ids records the highest possible node id, which is calculated by
scanning the bitmap node_states[N_POSSIBLE]. Current implementation
scan the bitmap from the beginning, which will scan the whole bitmap.This patch reverses the order by scanning from the end with
find_last_bit().Signed-off-by: Wei Yang
Cc: Tejun Heo
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Pull libnvdimm updates from Dan Williams:
"This update has successfully completed a 0day-kbuild run and has
appeared in a linux-next release. The changes outside of the typical
drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
the introduction of ZONE_DEVICE + devm_memremap_pages().Summary:
- Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map.This facility is used by the pmem driver to enable pfn_to_page()
operations on the page frames returned by DAX ('direct_access' in
'struct block_device_operations').For now, the 'memmap' allocation for these "device" pages comes
from "System RAM". Support for allocating the memmap from device
memory will arrive in a later kernel.- Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3.Completion of the conversion is targeted for v4.4.
- Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.- Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.- Miscellaneous updates and fixes to libnvdimm including support for
issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes"* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
libnvdimm, pmem: direct map legacy pmem by default
libnvdimm, pmem: 'struct page' for pmem
libnvdimm, pfn: 'struct page' provider infrastructure
x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
add devm_memremap_pages
mm: ZONE_DEVICE for "device memory"
mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
dax: drop size parameter to ->direct_access()
nd_blk: change aperture mapping from WC to WB
nvdimm: change to use generic kvfree()
pmem, dax: have direct_access use __pmem annotation
dax: update I/O path to do proper PMEM flushing
pmem: add copy_from_iter_pmem() and clear_pmem()
pmem, x86: clean up conditional pmem includes
pmem: remove layer when calling arch_has_wmb_pmem()
pmem, x86: move x86 PMEM API to new pmem.h header
libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
pmem: switch to devm_ allocations
devres: add devm_memremap
libnvdimm, btt: write and validate parent_uuid
...
28 Aug, 2015
1 commit
-
While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory. Device memory has
different lifetime and performance characteristics than RAM. However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Dave Hansen
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Jerome Glisse
[hch: various simplifications in the arch interface]
Signed-off-by: Christoph Hellwig
Signed-off-by: Dan Williams
22 Aug, 2015
1 commit
-
Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added
checks for page->pfmemalloc to __skb_fill_page_desc():if (page->pfmemalloc && !page->mapping)
skb->pfmemalloc = true;It assumes page->mapping == NULL implies that page->pfmemalloc can be
trusted. However, __delete_from_page_cache() can set set page->mapping
to NULL and leave page->index value alone. Due to being in union, a
non-zero page->index will be interpreted as true page->pfmemalloc.So the assumption is invalid if the networking code can see such a page.
And it seems it can. We have encountered this with a NFS over loopback
setup when such a page is attached to a new skbuf. There is no copying
going on in this case so the page confuses __skb_fill_page_desc which
interprets the index as pfmemalloc flag and the network stack drops
packets that have been allocated using the reserves unless they are to
be queued on sockets handling the swapping which is the case here and
that leads to hangs when the nfs client waits for a response from the
server which has been dropped and thus never arrive.The struct page is already heavily packed so rather than finding another
hole to put it in, let's do a trick instead. We can reuse the index
again but define it to an impossible value (-1UL). This is the page
index so it should never see the value that large. Replace all direct
users of page->pfmemalloc by page_is_pfmemalloc which will hide this
nastiness from unspoiled eyes.The information will get lost if somebody wants to use page->index
obviously but that was the case before and the original code expected
that the information should be persisted somewhere else if that is
really needed (e.g. what SLAB and SLUB do).[akpm@linux-foundation.org: fix blooper in slub]
Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb")
Signed-off-by: Michal Hocko
Debugged-by: Vlastimil Babka
Debugged-by: Jiri Bohac
Cc: Eric Dumazet
Cc: David Miller
Acked-by: Mel Gorman
Cc: [3.6+]
Signed-off-by: Andrew MortonSigned-off-by: Linus Torvalds
15 Aug, 2015
1 commit
-
When we add a new node, the edge of memory may be wrong.
e.g. system has 4 nodes, and node3 is movable, node3 mem:[24G-32G],
1. hotremove the node3,
2. then hotadd node3 with a part of memory, mem:[26G-30G],
3. call hotadd_new_pgdat()
free_area_init_node()
get_pfn_range_for_nid()
4. it will return wrong start_pfn and end_pfn, because we have not
update the memblock.This patch also fixes a BUG_ON during hot-addition, please see
http://marc.info/?l=linux-kernel&m=142961156129456&w=2Signed-off-by: Xishi Qiu
Cc: Yasuaki Ishimatsu
Cc: Kamezawa Hiroyuki
Cc: Taku Izumi
Cc: Tang Chen
Cc: Gu Zheng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Aug, 2015
4 commits
-
The race condition addressed in commit add05cecef80 ("mm: soft-offline:
don't free target page in successful page migration") was not closed
completely, because that can happen not only for soft-offline, but also
for hard-offline. Consider that a slab page is about to be freed into
buddy pool, and then an uncorrected memory error hits the page just
after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
necessary because the data on the affected page is not consumed.To solve it, this patch drops __PG_HWPOISON from page flag checks at
allocation/free time. I think it's justified because __PG_HWPOISON
flags is defined to prevent the page from being reused, and setting it
outside the page's alloc-free cycle is a designed behavior (not a bug.)For recent months, I was annoyed about BUG_ON when soft-offlined page
remains on lru cache list for a while, which is avoided by calling
put_page() instead of putback_lru_page() in page migration's success
path. This means that this patch reverts a major change from commit
add05cecef80 about the new refcounting rule of soft-offlined pages, so
"reuse window" revives. This will be closed by a subsequent patch.Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Dean Nelson
Cc: Tony Luck
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Dave Hansen reported the following;
My laptop has been behaving strangely with 4.2-rc2. Once I log
in to my X session, I start getting all kinds of strange errors
from applications and see this in my dmesg:VFS: file-max limit 8192 reached
The problem is that the file-max is calculated before memory is fully
initialised and miscalculates how much memory the kernel is using. This
patch recalculates file-max after deferred memory initialisation. Note
that using memory hotplug infrastructure would not have avoided this
problem as the value is not recalculated after memory hot-add.4.1: files_stat.max_files = 6582781
4.2-rc2: files_stat.max_files = 8192
4.2-rc2 patched: files_stat.max_files = 6562467Small differences with the patch applied and 4.1 but not enough to matter.
Signed-off-by: Mel Gorman
Reported-by: Dave Hansen
Cc: Nicolai Stange
Cc: Dave Hansen
Cc: Alex Ng
Cc: Fengguang Wu
Cc: Peter Zijlstra (Intel)
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit 0e1cc95b4cc7 ("mm: meminit: finish initialisation of struct pages
before basic setup") introduced a rwsem to signal completion of the
initialization workers.Lockdep complains about possible recursive locking:
=============================================
[ INFO: possible recursive locking detected ]
4.1.0-12802-g1dc51b8 #3 Not tainted
---------------------------------------------
swapper/0/1 is trying to acquire lock:
(pgdat_init_rwsem){++++.+},
at: [] page_alloc_init_late+0xc7/0xe6but task is already holding lock:
(pgdat_init_rwsem){++++.+},
at: [] page_alloc_init_late+0x3e/0xe6Replace the rwsem by a completion together with an atomic
"outstanding work counter".[peterz@infradead.org: Barrier removal on the grounds of being pointless]
[mgorman@suse.de: Applied review feedback]
Signed-off-by: Nicolai Stange
Signed-off-by: Mel Gorman
Acked-by: Peter Zijlstra (Intel)
Cc: Dave Hansen
Cc: Alex Ng
Cc: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
early_pfn_to_nid() historically was inherently not SMP safe but only
used during boot which is inherently single threaded or during hotplug
which is protected by a giant mutex.With deferred memory initialisation there was a thread-safe version
introduced and the early_pfn_to_nid would trigger a BUG_ON if used
unsafely. Memory hotplug hit that check. This patch makes
early_pfn_to_nid introduces a lock to make it safe to use during
hotplug.Signed-off-by: Mel Gorman
Reported-by: Alex Ng
Tested-by: Alex Ng
Acked-by: Peter Zijlstra (Intel)
Cc: Nicolai Stange
Cc: Dave Hansen
Cc: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Jul, 2015
3 commits
-
Currently, we set wrong gfp_mask to page_owner info in case of isolated
freepage by compaction and split page. It causes incorrect mixed
pageblock report that we can get from '/proc/pagetypeinfo'. This metric
is really useful to measure fragmentation effect so should be accurate.
This patch fixes it by setting correct information.Without this patch, after kernel build workload is finished, number of
mixed pageblock is 112 among roughly 210 movable pageblocks.But, with this fix, output shows that mixed pageblock is just 57.
Signed-off-by: Joonsoo Kim
Cc: Mel Gorman
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When I tested my new patches, I found that page pointer which is used
for setting page_owner information is changed. This is because page
pointer is used to set new migratetype in loop. After this work, page
pointer could be out of bound. If this wrong pointer is used for
page_owner, access violation happens. Below is error message that I
got.BUG: unable to handle kernel paging request at 0000000000b00018
IP: [] save_stack_address+0x30/0x40
PGD 1af2d067 PUD 166e0067 PMD 0
Oops: 0002 [#1] SMP
...snip...
Call Trace:
print_context_stack+0xcf/0x100
dump_trace+0x15f/0x320
save_stack_trace+0x2f/0x50
__set_page_owner+0x46/0x70
__isolate_free_page+0x1f7/0x210
split_free_page+0x21/0xb0
isolate_freepages_block+0x1e2/0x410
compaction_alloc+0x22d/0x2d0
migrate_pages+0x289/0x8b0
compact_zone+0x409/0x880
compact_zone_order+0x6d/0x90
try_to_compact_pages+0x110/0x210
__alloc_pages_direct_compact+0x3d/0xe6
__alloc_pages_nodemask+0x6cd/0x9a0
alloc_pages_current+0x91/0x100
runtest_store+0x296/0xa50
simple_attr_write+0xbd/0xe0
__vfs_write+0x28/0xf0
vfs_write+0xa9/0x1b0
SyS_write+0x46/0xb0
system_call_fastpath+0x16/0x75This patch fixes this error by moving up set_page_owner().
Signed-off-by: Joonsoo Kim
Cc: Mel Gorman
Cc: Vlastimil Babka
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The kbuild test robot reported the following
tree: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
head: 14a6f1989dae9445d4532941bdd6bbad84f4c8da
commit: 3b242c66ccbd60cf47ab0e8992119d9617548c23 x86: mm: enable deferred struct page initialisation on x86-64
date: 3 days ago
config: x86_64-randconfig-x006-201527 (attached as .config)
reproduce:
git checkout 3b242c66ccbd60cf47ab0e8992119d9617548c23
# save the attached .config to linux build tree
make ARCH=x86_64All warnings (new ones prefixed by >>):
mm/page_alloc.c: In function 'early_page_uninitialised':
>> mm/page_alloc.c:247:6: warning: unused variable 'nid' [-Wunused-variable]
int nid = early_pfn_to_nid(pfn);It's due to the NODE_DATA macro ignoring the nid parameter on !NUMA
configurations. This patch avoids the warning by not declaring nid.Signed-off-by: Mel Gorman
Reported-by: Wu Fengguang
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Jul, 2015
12 commits
-
Waiman Long reported that 24TB machines hit OOM during basic setup when
struct page initialisation was deferred. One approach is to initialise
memory on demand but it interferes with page allocator paths. This patch
creates dedicated threads to initialise memory before basic setup. It
then blocks on a rw_semaphore until completion as a wait_queue and counter
is overkill. This may be slower to boot but it's simplier overall and
also gets rid of a section mangling which existed so kswapd could do the
initialisation.[akpm@linux-foundation.org: include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast]
Signed-off-by: Mel Gorman
Cc: Waiman Long
Cc: Dave Hansen
Cc: Scott Norton
Tested-by: Daniel J Blueman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mminit_verify_page_links() is an extremely paranoid check that was
introduced when memory initialisation was being heavily reworked.
Profiles indicated that up to 10% of parallel memory initialisation was
spent on checking this for every page. The cost could be reduced but in
practice this check only found problems very early during the
initialisation rewrite and has found nothing since. This patch removes an
expensive unnecessary check.Signed-off-by: Mel Gorman
Tested-by: Nate Zimmer
Tested-by: Waiman Long
Tested-by: Daniel J Blueman
Acked-by: Pekka Enberg
Cc: Robin Holt
Cc: Nate Zimmer
Cc: Dave Hansen
Cc: Waiman Long
Cc: Scott Norton
Cc: "Luck, Tony"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
During parallel sturct page initialisation, ranges are checked for every
PFN unnecessarily which increases boot times. This patch alters when the
ranges are checked.Signed-off-by: Mel Gorman
Tested-by: Nate Zimmer
Tested-by: Waiman Long
Tested-by: Daniel J Blueman
Acked-by: Pekka Enberg
Cc: Robin Holt
Cc: Nate Zimmer
Cc: Dave Hansen
Cc: Waiman Long
Cc: Scott Norton
Cc: "Luck, Tony"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Parallel struct page frees pages one at a time. Try free pages as single
large pages where possible.Signed-off-by: Mel Gorman
Tested-by: Nate Zimmer
Tested-by: Waiman Long
Tested-by: Daniel J Blueman
Acked-by: Pekka Enberg
Cc: Robin Holt
Cc: Nate Zimmer
Cc: Dave Hansen
Cc: Waiman Long
Cc: Scott Norton
Cc: "Luck, Tony"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Deferred struct page initialisation is using pfn_to_page() on every PFN
unnecessarily. This patch minimises the number of lookups and scheduler
checks.Signed-off-by: Mel Gorman
Tested-by: Nate Zimmer
Tested-by: Waiman Long
Tested-by: Daniel J Blueman
Acked-by: Pekka Enberg
Cc: Robin Holt
Cc: Nate Zimmer
Cc: Dave Hansen
Cc: Waiman Long
Cc: Scott Norton
Cc: "Luck, Tony"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Only a subset of struct pages are initialised at the moment. When this
patch is applied kswapd initialise the remaining struct pages in parallel.This should boot faster by spreading the work to multiple CPUs and
initialising data that is local to the CPU. The user-visible effect on
large machines is that free memory will appear to rapidly increase early
in the lifetime of the system until kswapd reports that all memory is
initialised in the kernel log. Once initialised there should be no other
user-visibile effects.Signed-off-by: Mel Gorman
Tested-by: Nate Zimmer
Tested-by: Waiman Long
Tested-by: Daniel J Blueman
Acked-by: Pekka Enberg
Cc: Robin Holt
Cc: Nate Zimmer
Cc: Dave Hansen
Cc: Waiman Long
Cc: Scott Norton
Cc: "Luck, Tony"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch initalises all low memory struct pages and 2G of the highest
zone on each node during memory initialisation if
CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. That config option cannot be set
but will be available in a later patch. Parallel initialisation of struct
page depends on some features from memory hotplug and it is necessary to
alter alter section annotations.Signed-off-by: Mel Gorman
Tested-by: Nate Zimmer
Tested-by: Waiman Long
Tested-by: Daniel J Blueman
Acked-by: Pekka Enberg
Cc: Robin Holt
Cc: Nate Zimmer
Cc: Dave Hansen
Cc: Waiman Long
Cc: Scott Norton
Cc: "Luck, Tony"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are
unnecessarily visible outside memory initialisation. As well as
unnecessary visibility, it's unnecessary function call overhead when
initialising pages. This patch moves the helpers inline.[akpm@linux-foundation.org: fix build]
[mhocko@suse.cz: fix build]
Signed-off-by: Mel Gorman
Tested-by: Nate Zimmer
Tested-by: Waiman Long
Tested-by: Daniel J Blueman
Acked-by: Pekka Enberg
Cc: Robin Holt
Cc: Nate Zimmer
Cc: Dave Hansen
Cc: Waiman Long
Cc: Scott Norton
Cc: "Luck, Tony"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__early_pfn_to_nid() use static variables to cache recent lookups as
memblock lookups are very expensive but it assumes that memory
initialisation is single-threaded. Parallel initialisation of struct
pages will break that assumption so this patch makes __early_pfn_to_nid()
SMP-safe by requiring the caller to cache recent search information.
early_pfn_to_nid() keeps the same interface but is only safe to use early
in boot due to the use of a global static variable. meminit_pfn_in_nid()
is an SMP-safe version that callers must maintain their own state for.Signed-off-by: Mel Gorman
Tested-by: Nate Zimmer
Tested-by: Waiman Long
Tested-by: Daniel J Blueman
Acked-by: Pekka Enberg
Cc: Robin Holt
Cc: Nate Zimmer
Cc: Dave Hansen
Cc: Waiman Long
Cc: Scott Norton
Cc: "Luck, Tony"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__free_pages_bootmem prepares a page for release to the buddy allocator
and assumes that the struct page is initialised. Parallel initialisation
of struct pages defers initialisation and __free_pages_bootmem can be
called for struct pages that cannot yet map struct page to PFN. This
patch passes PFN to __free_pages_bootmem with no other functional change.Signed-off-by: Mel Gorman
Tested-by: Nate Zimmer
Tested-by: Waiman Long
Tested-by: Daniel J Blueman
Acked-by: Pekka Enberg
Cc: Robin Holt
Cc: Nate Zimmer
Cc: Dave Hansen
Cc: Waiman Long
Cc: Scott Norton
Cc: "Luck, Tony"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently each page struct is set as reserved upon initialization. This
patch leaves the reserved bit clear and only sets the reserved bit when it
is known the memory was allocated by the bootmem allocator. This makes it
easier to distinguish between uninitialised struct pages and reserved
struct pages in later patches.Signed-off-by: Robin Holt
Signed-off-by: Nathan Zimmer
Signed-off-by: Mel Gorman
Tested-by: Nate Zimmer
Tested-by: Waiman Long
Tested-by: Daniel J Blueman
Acked-by: Pekka Enberg
Cc: Robin Holt
Cc: Dave Hansen
Cc: Waiman Long
Cc: Scott Norton
Cc: "Luck, Tony"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, memmap_init_zone() has all the smarts for initializing a single
page. A subset of this is required for parallel page initialisation and
so this patch breaks up the monolithic function in preparation.Signed-off-by: Robin Holt
Signed-off-by: Nathan Zimmer
Signed-off-by: Mel Gorman
Tested-by: Nate Zimmer
Tested-by: Waiman Long
Tested-by: Daniel J Blueman
Acked-by: Pekka Enberg
Cc: Robin Holt
Cc: Dave Hansen
Cc: Waiman Long
Cc: Scott Norton
Cc: "Luck, Tony"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
25 Jun, 2015
5 commits
-
Merge first patchbomb from Andrew Morton:
- a few misc things
- ocfs2 udpates
- kernel/watchdog.c feature work (took ages to get right)
- most of MM. A few tricky bits are held up and probably won't make 4.2.
* emailed patches from Andrew Morton : (91 commits)
mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()
mm, thp: respect MPOL_PREFERRED policy with non-local node
tmpfs: truncate prealloc blocks past i_size
mm/memory hotplug: print the last vmemmap region at the end of hot add memory
mm/mmap.c: optimization of do_mmap_pgoff function
mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan
mm: kmemleak: avoid deadlock on the kmemleak object insertion error path
mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup()
mm: kmemleak: fix delete_object_*() race when called on the same memory block
mm: kmemleak: allow safe memory scanning during kmemleak disabling
memcg: convert mem_cgroup->under_oom from atomic_t to int
memcg: remove unused mem_cgroup->oom_wakeups
frontswap: allow multiple backends
x86, mirror: x86 enabling - find mirrored memory ranges
mm/memblock: allocate boot time data structures from mirrored memory
mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute
mm: do not ignore mapping_gfp_mask in page cache allocation paths
mm/cma.c: fix typos in comments
mm/oom_kill.c: print points as unsigned int
mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
... -
The should_alloc_retry() function was meant to encapsulate retry
conditions of the allocator slowpath, but there are still checks
remaining in the main function, and much of how the retrying is
performed also depends on the OOM killer progress. The physical
separation of those conditions make the code hard to follow.Inline the should_alloc_retry() checks. Notes:
- The __GFP_NOFAIL check is already done in __alloc_pages_may_oom(),
replace it with looping on OOM killer progress- The pm_suspended_storage() check is meant to skip the OOM killer
when reclaim has no IO available, move to __alloc_pages_may_oom()- The order
Acked-by: Michal Hocko
Cc: David Rientjes
Cc: Tetsuo Handa
Cc: Andrea Arcangeli
Cc: Dave Chinner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The zonelist locking and the oom_sem are two overlapping locks that are
used to serialize global OOM killing against different things.The historical zonelist locking serializes OOM kills from allocations with
overlapping zonelists against each other to prevent killing more tasks
than necessary in the same memory domain. Only when neither tasklists nor
zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
bound to separate nodes) are OOM kills allowed to execute in parallel.The younger oom_sem is a read-write lock to serialize OOM killing against
the PM code trying to disable the OOM killer altogether.However, the OOM killer is a fairly cold error path, there is really no
reason to optimize for highly performant and concurrent OOM kills. And
the oom_sem is just flat-out redundant.Replace both locking schemes with a single global mutex serializing OOM
kills regardless of context.Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: David Rientjes
Cc: Tetsuo Handa
Cc: Andrea Arcangeli
Cc: Dave Chinner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Init the zone's size when calculating node totalpages to avoid duplicated
operations in free_area_init_core().Signed-off-by: Gu Zheng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It's been five years now that KM_* kmap flags have been removed and that
we can call clear_highpage from any context. So we remove prep_zero_pages
accordingly.Signed-off-by: Anisse Astier
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds