Eric Lee / smarc-fsl-linux-kernel

09 Sep, 2015

40 commits

c11539315 mm/memblock.c: fiy typos in comments ... Browse Code »

s/succees/success/

Signed-off-by: Alexander Kuleshov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Kuleshov
2015-09-09 06:35:28 +0800
1a16718cf mm/compaction: correct to flush migrated pages if pageblock skip happens ... Browse Code »

We cache isolate_start_pfn before entering isolate_migratepages(). If
pageblock is skipped in isolate_migratepages() due to whatever reason,
cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages
that were freed. For example, the following scenario can be possible:

- assume order-9 compaction, pageblock order is 9
- start_isolate_pfn is 0x200
- isolate_migratepages()
- skip a number of pageblocks
- start to isolate from pfn 0x600
- cc->migrate_pfn = 0x620
- return
- last_migrated_pfn is set to 0x200
- check flushing condition
- current_block_start is set to 0x600
- last_migrated_pfn < current_block_start then do useless flush

This wrong flush would not help the performance and success rate so this
patch tries to fix it. One simple way to know the exact position where
we start to isolate migratable pages is that we cache it in
isolate_migratepages() before entering actual isolation. This patch
implements that and fixes the problem.

Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Cc: David Rientjes
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2015-09-09 06:35:28 +0800
82c1fc714 mm: use numa_mem_id() in alloc_pages_node() ... Browse Code »

alloc_pages_node() might fail when called with NUMA_NO_NODE and
__GFP_THISNODE on a CPU belonging to a memoryless node. To make the
local-node fallback more robust and prevent such situations, use
numa_mem_id(), which was introduced for similar scenarios in the slab
context.

Suggested-by: Christoph Lameter
Signed-off-by: Vlastimil Babka
Acked-by: David Rientjes
Acked-by: Mel Gorman
Acked-by: Christoph Lameter
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-09-09 06:35:28 +0800
0bc35a970 mm: unify checks in alloc_pages_node() and __alloc_pages_node() ... Browse Code »

Perform the same debug checks in alloc_pages_node() as are done in
__alloc_pages_node(), by making the former function a wrapper of the
latter one.

In addition to better diagnostics in DEBUG_VM builds for situations
which have been already fatal (e.g. out-of-bounds node id), there are
two visible changes for potential existing buggy callers of
alloc_pages_node():

- calling alloc_pages_node() with any negative nid (e.g. due to arithmetic
overflow) was treated as passing NUMA_NO_NODE and fallback to local node was
applied. This will now be fatal.
- calling alloc_pages_node() with an offline node will now be checked for
DEBUG_VM builds. Since it's not fatal if the node has been previously online,
and this patch may expose some existing buggy callers, change the VM_BUG_ON
in __alloc_pages_node() to VM_WARN_ON.

Signed-off-by: Vlastimil Babka
Acked-by: David Rientjes
Acked-by: Johannes Weiner
Acked-by: Christoph Lameter
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-09-09 06:35:28 +0800
96db800f5 mm: rename alloc_pages_exact_node() to __alloc_pages_node() ... Browse Code »

alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.

The misleading name has lead to mistakes in the past, see for example
commits 5265047ac301 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f8e ("mm, mempolicy:
migrate_to_node should only migrate to node").

Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.

To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.

It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.

Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.

Both differences will be rectified by the next patch.

To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.

Signed-off-by: Vlastimil Babka
Acked-by: Johannes Weiner
Acked-by: Robin Holt
Acked-by: Michal Hocko
Acked-by: Christoph Lameter
Acked-by: Michael Ellerman
Cc: Mel Gorman
Cc: David Rientjes
Cc: Greg Thelen
Cc: Aneesh Kumar K.V
Cc: Pekka Enberg
Cc: Joonsoo Kim
Cc: Naoya Horiguchi
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Gleb Natapov
Cc: Paolo Bonzini
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Cliff Whickman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-09-09 06:35:28 +0800
7fadc8202 mm, vmscan: unlock page while waiting on writeback ... Browse Code »

This is merely a politeness: I've not found that shrink_page_list()
leads to deadlock with the page it holds locked across
wait_on_page_writeback(); but nevertheless, why hold others off by
keeping the page locked there?

And while we're at it: remove the mistaken "not " from the commentary on
this Case 3 (and a distracting blank line from Case 2, if I may).

Signed-off-by: Hugh Dickins
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2015-09-09 06:35:28 +0800
26f5d7609 list_lru: don't call list_lru_from_kmem if the list_head is empty ... Browse Code »

If the list_head is empty then we'll have called list_lru_from_kmem for
nothing. Move that call inside of the list_empty if block.

Signed-off-by: Jeff Layton
Reviewed-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Layton
2015-09-09 06:35:28 +0800
21cd3a604 kmemleak: record accurate early log buffer count and report when exceeded ... Browse Code »

In log_early function, crt_early_log should also count once when
'crt_early_log >= ARRAY_SIZE(early_log)'. Otherwise the reported count
from kmemleak_init is one less than 'actual number'.

Then, in kmemleak_init, if early_log buffer size equal actual number,
kmemleak will init sucessful, so change warning condition to
'crt_early_log > ARRAY_SIZE(early_log)'.

Signed-off-by: Wang Kai
Acked-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Kai
2015-09-09 06:35:28 +0800
e39758912 mm/mmap.c: simplify the failure return working flow ... Browse Code »

__split_vma() doesn't need out_err label, neither need initializing err.

copy_vma() can return NULL directly when kmem_cache_alloc() fails.

Signed-off-by: Chen Gang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2015-09-09 06:35:28 +0800
44a30220b shmem: recalculate file inode when fstat ... Browse Code »

Shmem uses shmem_recalc_inode to update i_blocks when it allocates page,
undoes range or swaps. But mm can drop clean page without notifying
shmem. This makes fstat sometimes return out-of-date block size.

The problem can be partially solved when we add
inode_operations->getattr which calls shmem_recalc_inode to update
i_blocks for fstat.

shmem_recalc_inode also updates counter used by statfs and
vm_committed_as. For them the situation is not changed. They still
suffer from the discrepancy after dropping clean page and before the
function is called by aforementioned triggers.

Signed-off-by: Yu Zhao
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yu Zhao
2015-09-09 06:35:28 +0800
567d117b8 mm/memblock.c: rename local variable of memblock_type to 'type' ... Browse Code »

Since commit e3239ff92a17 ("memblock: Rename memblock_region to
memblock_type and memblock_property to memblock_region"), all local
variables of the membock_type type were renamed to 'type'. This commit
renames all remaining local variables with the memblock_type type to the
same view.

Signed-off-by: Alexander Kuleshov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Kuleshov
2015-09-09 06:35:28 +0800
230ac719c mm/hwpoison: don't try to unpoison containment-failed pages ... Browse Code »

memory_failure() can be called at any page at any time, which means that
we can't eliminate the possibility of containment failure. In such case
the best option is to leak the page intentionally (and never touch it
later.)

We have an unpoison function for testing, and it cannot handle such
containment-failed pages, which results in kernel panic (visible with
various calltraces.) So this patch suggests that we limit the
unpoisonable pages to properly contained pages and ignore any other
ones.

Testers are recommended to keep in mind that there're un-unpoisonable
pages when writing test programs.

Signed-off-by: Naoya Horiguchi
Tested-by: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-09-09 06:35:28 +0800
da1b13ccf mm/hwpoison: fix race between soft_offline_page and unpoison_memory ... Browse Code »

Wanpeng Li reported a race between soft_offline_page() and
unpoison_memory(), which causes the following kernel panic:

BUG: Bad page state in process bash pfn:97000
page:ffffea00025c0000 count:0 mapcount:1 mapping: (null) index:0x7f4fdbe00
flags: 0x1fffff80080048(uptodate|active|swapbacked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x40(active)
Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core
CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45
Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
Call Trace:
dump_stack+0x48/0x5c
bad_page+0xe6/0x140
free_pages_prepare+0x2f9/0x320
? uncharge_list+0xdd/0x100
free_hot_cold_page+0x40/0x170
__put_single_page+0x20/0x30
put_page+0x25/0x40
unmap_and_move+0x1a6/0x1f0
migrate_pages+0x100/0x1d0
? kill_procs+0x100/0x100
? unlock_page+0x6f/0x90
__soft_offline_page+0x127/0x2a0
soft_offline_page+0xa6/0x200

This race is explained like below:

CPU0 CPU1

soft_offline_page
__soft_offline_page
TestSetPageHWPoison
unpoison_memory
PageHWPoison check (true)
TestClearPageHWPoison
put_page -> release refcount held by get_hwpoison_page in unpoison_memory
put_page -> release refcount held by isolate_lru_page in __soft_offline_page
migrate_pages

The second put_page() releases refcount held by isolate_lru_page() which
will lead to unmap_and_move() releases the last refcount of page and w/
mapcount still 1 since try_to_unmap() is not called if there is only one
user map the page. Anyway, the page refcount and mapcount will still
mess if the page is mapped by multiple users.

This race was introduced by commit 4491f71260 ("mm/memory-failure: set
PageHWPoison before migrate_pages()"), which focuses on preventing the
reuse of successfully migrated page. Before this commit we prevent the
reuse by changing the migratetype to MIGRATE_ISOLATE during soft
offlining, which has the following problems, so simply reverting the
commit is not a best option:

1) it doesn't eliminate the reuse completely, because
set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the
target page if the pageblock of the page contains one or more
unmovable pages (i.e. has_unmovable_pages() returns true).

2) the original code changes migratetype to MIGRATE_ISOLATE
forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline,
regardless of the original migratetype state, which could impact
other subsystems like memory hotplug or compaction.

This patch moves PageSetHWPoison just after put_page() in
unmap_and_move(), which closes up the reported race window and minimizes
another race window b/w SetPageHWPoison and reallocation (which causes
the reuse of soft-offlined page.) The latter race window still exists
but it's acceptable, because it's rare and effectively the same as
ordinary "containment failure" case even if it happens, so keep the
window open is acceptable.

Fixes: 4491f71260 ("mm/memory-failure: set PageHWPoison before migrate_pages()")
Signed-off-by: Wanpeng Li
Signed-off-by: Naoya Horiguchi
Reported-by: Wanpeng Li
Tested-by: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2015-09-09 06:35:28 +0800
8e30456b6 mm/hwpoison: introduce num_poisoned_pages wrappers ... Browse Code »

num_poisoned_pages counter will be changed outside mm/memory-failure.c
by a subsequent patch, so this patch prepares wrappers to manipulate it.

Signed-off-by: Naoya Horiguchi
Tested-by: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-09-09 06:35:28 +0800
665d9da7f mm/hwpoison: replace most of put_page in memory error handling by put_hwpoison_page ... Browse Code »

Replace most instances of put_page() in memory error handling with
put_hwpoison_page().

Signed-off-by: Wanpeng Li
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2015-09-09 06:35:28 +0800
be91748fa mm/hwpoison: fix refcount of THP head page in no-injection case ... Browse Code »

Hwpoison injection takes a refcount of target page and another refcount
of head page of THP if the target page is the tail page of a THP.
However, current code doesn't release the refcount of head page if the
THP is not supported to be injected wrt hwpoison filter.

Fix it by reducing the refcount of head page if the target page is the
tail page of a THP and it is not supported to be injected.

Signed-off-by: Wanpeng Li
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2015-09-09 06:35:28 +0800
94bf4ec84 mm/hwpoison: introduce put_hwpoison_page to put refcount for memory error handling ... Browse Code »

Introduce put_hwpoison_page to put refcount for memory error handling.

Signed-off-by: Wanpeng Li
Suggested-by: Naoya Horiguchi
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2015-09-09 06:35:28 +0800
1e0e635be mm/hwpoison: fix PageHWPoison test/set race ... Browse Code »

There is a race between madvise_hwpoison path and memory_failure:

CPU0 CPU1

madvise_hwpoison
get_user_pages_fast
PageHWPoison check (false)
memory_failure
TestSetPageHWPoison
soft_offline_page
PageHWPoison check (true)
return -EBUSY (without put_page)

Signed-off-by: Wanpeng Li
Suggested-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2015-09-09 06:35:28 +0800
7d1900c74 mm/hwpoison: fix failure to split thp w/ refcount held ... Browse Code »

THP pages will get a refcount in madvise_hwpoison() w/
MF_COUNT_INCREASED flag, however, the refcount is still held when fail
to split THP pages.

Fix it by reducing the refcount of THP pages when fail to split THP.

Signed-off-by: Wanpeng Li
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2015-09-09 06:35:28 +0800
5dd2c4bde x86: use generic early mem copy ... Browse Code »

The early_ioremap library now has a generic copy_from_early_mem()
function. Use the generic copy function for x86 relocate_initrd().

[akpm@linux-foundation.org: remove MAX_MAP_CHUNK define, per Yinghai Lu]
Signed-off-by: Mark Salter
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Arnd Bergmann
Cc: Ard Biesheuvel
Cc: Mark Rutland
Cc: Russell King
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mark Salter
2015-09-09 06:35:28 +0800
1570f0d7a arm64: support initrd outside kernel linear map ... Browse Code »

The use of mem= could leave part or all of the initrd outside of the
kernel linear map. This will lead to an error when unpacking the initrd
and a probable failure to boot. This patch catches that situation and
relocates the initrd to be fully within the linear map.

Signed-off-by: Mark Salter
Acked-by: Will Deacon
Cc: Catalin Marinas
Cc: Arnd Bergmann
Cc: Ard Biesheuvel
Cc: Mark Rutland
Cc: Russell King
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mark Salter
2015-09-09 06:35:28 +0800
6b0f68e32 mm: add utility for early copy from unmapped ram ... Browse Code »

When booting an arm64 kernel w/initrd using UEFI/grub, use of mem= will
likely cut off part or all of the initrd. This leaves it outside the
kernel linear map which leads to failure when unpacking. The x86 code
has a similar need to relocate an initrd outside of mapped memory in
some cases.

The current x86 code uses early_memremap() to copy the original initrd
from unmapped to mapped RAM. This patchset creates a generic
copy_from_early_mem() utility based on that x86 code and has arm64 and
x86 share it in their respective initrd relocation code.

This patch (of 3):

In some early boot circumstances, it may be necessary to copy from RAM
outside the kernel linear mapping to mapped RAM. The need to relocate
an initrd is one example in the x86 code. This patch creates a helper
function based on current x86 code.

Signed-off-by: Mark Salter
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Arnd Bergmann
Cc: Ard Biesheuvel
Cc: Mark Rutland
Cc: Russell King
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mark Salter
2015-09-09 06:35:28 +0800
e6590740c Documentation: update libhugetlbfs location and use for testing ... Browse Code »

The URL for libhugetlbfs has changed. Also, put a stronger emphasis on
using libgugetlbfs for hugetlb regression testing.

Signed-off-by: Mike Kravetz
Acked-by: Naoya Horiguchi
Cc: Joern Engel
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Shuah Khan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-09-09 06:35:28 +0800
fd5a9ecd6 selftests:vm: point to libhugetlbfs for regression testing ... Browse Code »

The hugetlb selftests provide minimal coverage. Have run script point
people at libhugetlbfs for better regression testing.

Signed-off-by: Mike Kravetz
Acked-by: Naoya Horiguchi
Cc: Joern Engel
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Shuah Khan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-09-09 06:35:28 +0800
243db5351 Revert "selftests: add hugetlbfstest" ... Browse Code »

This manually reverts 7e50533d4b842 ("selftests: add hugetlbfstest").

The hugetlbfstest test depends on hugetlb pages being counted in a
task's rss. This functionality is not in the kernel, so the test will
always fail. Remove test to avoid confusion.

Signed-off-by: Mike Kravetz
Acked-by: Naoya Horiguchi
Cc: Joern Engel
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Shuah Khan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-09-09 06:35:28 +0800
9fcd6d2e0 mm, compaction: skip compound pages by order in free scanner ... Browse Code »

The compaction free scanner is looking for PageBuddy() pages and
skipping all others. For large compound pages such as THP or hugetlbfs,
we can save a lot of iterations if we skip them at once using their
compound_order(). This is generally unsafe and we can read a bogus
value of order due to a race, but if we are careful, the only danger is
skipping too much.

When tested with stress-highalloc from mmtests on 4GB system with 1GB
hugetlbfs pages, the vmstat compact_free_scanned count decreased by at
least 15%.

Signed-off-by: Vlastimil Babka
Cc: Minchan Kim
Cc: Mel Gorman
Acked-by: Joonsoo Kim
Acked-by: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-09-09 06:35:28 +0800
29c0dde83 mm, compaction: always skip all compound pages by order in migrate scanner ... Browse Code »

The compaction migrate scanner tries to skip THP pages by their order,
to reduce number of iterations for pages it cannot isolate. The check
is only done if PageLRU() is true, which means it applies to THP pages,
but not e.g. hugetlbfs pages or any other non-LRU compound pages, which
we have to iterate by base pages.

This limitation comes from the assumption that it's only safe to read
compound_order() when we have the zone's lru_lock and THP cannot be
split under us. But the only danger (after filtering out order values
that are not below MAX_ORDER, to prevent overflows) is that we skip too
much or too little after reading a bogus compound_order() due to a rare
race. This is the same reasoning as patch 99c0fd5e51c4 ("mm,
compaction: skip buddy pages by their order in the migrate scanner")
introduced for unsafely reading PageBuddy() order.

After this patch, all pages are tested for PageCompound() and we skip
them by compound_order(). The test is done after the test for
balloon_page_movable() as we don't want to assume if balloon pages (or
other pages with own isolation and migration implementation if a generic
API gets implemented) are compound or not.

When tested with stress-highalloc from mmtests on 4GB system with 1GB
hugetlbfs pages, the vmstat compact_migrate_scanned count decreased by
15%.

[kirill.shutemov@linux.intel.com: change PageTransHuge checks to PageCompound for different series was squashed here]
Signed-off-by: Vlastimil Babka
Cc: Minchan Kim
Acked-by: Mel Gorman
Acked-by: Joonsoo Kim
Acked-by: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-09-09 06:35:28 +0800
02333641e mm, compaction: encapsulate resetting cached scanner positions ... Browse Code »

Reseting the cached compaction scanner positions is now open-coded in
__reset_isolation_suitable() and compact_finished(). Encapsulate the
functionality in a new function reset_cached_positions().

Signed-off-by: Vlastimil Babka
Cc: Minchan Kim
Cc: Mel Gorman
Cc: Joonsoo Kim
Acked-by: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-09-09 06:35:28 +0800
f5f61a320 mm, compaction: simplify handling restart position in free pages scanner ... Browse Code »

Handling the position where compaction free scanner should restart
(stored in cc->free_pfn) got more complex with commit e14c720efdd7 ("mm,
compaction: remember position within pageblock in free pages scanner").
Currently the position is updated in each loop iteration of
isolate_freepages(), although it should be enough to update it only when
breaking from the loop. There's also an extra check outside the loop
updates the position in case we have met the migration scanner.

This can be simplified if we move the test for having isolated enough
from the for-loop header next to the test for contention, and
determining the restart position only in these cases. We can reuse the
isolate_start_pfn variable for this instead of setting cc->free_pfn
directly. Outside the loop, we can simply set cc->free_pfn to current
value of isolate_start_pfn without any extra check.

Also add a VM_BUG_ON to catch possible mistake in the future, in case we
later add a new condition that terminates isolate_freepages_block()
prematurely without also considering the condition in
isolate_freepages().

Signed-off-by: Vlastimil Babka
Cc: Minchan Kim
Acked-by: Mel Gorman
Acked-by: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-09-09 06:35:28 +0800
f2849aa09 mm, compaction: more robust check for scanners meeting ... Browse Code »

Assorted compaction cleanups and optimizations. The interesting patches
are 4 and 5. In 4, skipping of compound pages in single iteration is
improved for migration scanner, so it works also for !PageLRU compound
pages such as hugetlbfs, slab etc. Patch 5 introduces this kind of
skipping in the free scanner. The trick is that we can read
compound_order() without any protection, if we are careful to filter out
values larger than MAX_ORDER. The only danger is that we skip too much.
The same trick was already used for reading the freepage order in the
migrate scanner.

To demonstrate improvements of Patches 4 and 5 I've run stress-highalloc
from mmtests, set to simulate THP allocations (including __GFP_COMP) on
a 4GB system where 1GB was occupied by hugetlbfs pages. I'll include
just the relevant stats:

Patch 3 Patch 4 Patch 5

Compaction stalls 7523 7529 7515
Compaction success 323 304 322
Compaction failures 7200 7224 7192
Page migrate success 247778 264395 240737
Page migrate failure 15358 33184 21621
Compaction pages isolated 906928 980192 909983
Compaction migrate scanned 2005277 1692805 1498800
Compaction free scanned 13255284 11539986 9011276
Compaction cost 288 305 277

With 5 iterations per patch, the results are still noisy, but we can see
that Patch 4 does reduce migrate_scanned by 15% thanks to skipping the
hugetlbfs pages at once. Interestingly, free_scanned is also reduced
and I have no idea why. Patch 5 further reduces free_scanned as
expected, by 15%. Other stats are unaffected modulo noise.

[1] https://lkml.org/lkml/2015/1/19/158

This patch (of 5):

Compaction should finish when the migration and free scanner meet, i.e.
they reach the same pageblock. Currently however, the test in
compact_finished() simply just compares the exact pfns, which may yield
a false negative when the free scanner position is in the middle of a
pageblock and the migration scanner reaches the begining of the same
pageblock.

This hasn't been a problem until commit e14c720efdd7 ("mm, compaction:
remember position within pageblock in free pages scanner") allowed the
free scanner position to be in the middle of a pageblock between
invocations. The hot-fix 1d5bfe1ffb5b ("mm, compaction: prevent
infinite loop in compact_zone") prevented the issue by adding a special
check in the migration scanner to satisfy the current detection of
scanners meeting.

However, the proper fix is to make the detection more robust. This
patch introduces the compact_scanners_met() function that returns true
when the free scanner position is in the same or lower pageblock than
the migration scanner. The special case in isolate_migratepages()
introduced by 1d5bfe1ffb5b is removed.

Suggested-by: Joonsoo Kim
Signed-off-by: Vlastimil Babka
Cc: Minchan Kim
Acked-by: Mel Gorman
Acked-by: Joonsoo Kim
Acked-by: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Acked-by: Rik van Riel
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-09-09 06:35:28 +0800
1fc524d74 coccinelle: mm: scripts/coccinelle/api/alloc/pool_zalloc-simple.cocci ... Browse Code »

add [pci|dma]_pool_zalloc coccinelle check.
replaces instances of [pci|dma]_pool_alloc() followed by memset(0)
with [pci|dma]_pool_zalloc().

Signed-off-by: Sean O. Stalley
Acked-by: Julia Lawall
Cc: Vinod Koul
Cc: Bjorn Helgaas
Cc: Gilles Muller
Cc: Nicolas Palix
Cc: Michal Marek
Cc: Sebastian Andrzej Siewior
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sean O. Stalley
2015-09-09 06:35:28 +0800
01a7fd337 pci: mm: add pci_pool_zalloc() call ... Browse Code »

Add a wrapper function for pci_pool_alloc() to get zeroed memory.

Signed-off-by: Sean O. Stalley
Cc: Vinod Koul
Cc: Bjorn Helgaas
Cc: Gilles Muller
Cc: Nicolas Palix
Cc: Michal Marek
Cc: Sebastian Andrzej Siewior
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sean O. Stalley
2015-09-09 06:35:28 +0800
ad82362b2 mm: add dma_pool_zalloc() call to DMA API ... Browse Code »

Add a wrapper function for dma_pool_alloc() to get zeroed memory.

Signed-off-by: Sean O. Stalley
Cc: Vinod Koul
Cc: Bjorn Helgaas
Cc: Gilles Muller
Cc: Nicolas Palix
Cc: Michal Marek
Cc: Sebastian Andrzej Siewior
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sean O. Stalley
2015-09-09 06:35:28 +0800
fa23f56d9 mm: add support for __GFP_ZERO flag to dma_pool_alloc() ... Browse Code »

Currently a call to dma_pool_alloc() with a ___GFP_ZERO flag returns a
non-zeroed memory region.

This patchset adds support for the __GFP_ZERO flag to dma_pool_alloc(),
adds 2 wrapper functions for allocing zeroed memory from a pool, and
provides a coccinelle script for finding & replacing instances of
dma_pool_alloc() followed by memset(0) with a single dma_pool_zalloc()
call.

There was some concern that this always calls memset() to zero, instead
of passing __GFP_ZERO into the page allocator.
[https://lkml.org/lkml/2015/7/15/881]

I ran a test on my system to get an idea of how often dma_pool_alloc()
calls into pool_alloc_page().

After Boot: [ 30.119863] alloc_calls:541, page_allocs:7
After an hour: [ 3600.951031] alloc_calls:9566, page_allocs:12
After copying 1GB file onto a USB drive:
[ 4260.657148] alloc_calls:17225, page_allocs:12

It doesn't look like dma_pool_alloc() calls down to the page allocator
very often (at least on my system).

This patch (of 4):

Currently the __GFP_ZERO flag is ignored by dma_pool_alloc().
Make dma_pool_alloc() zero the memory if this flag is set.

Signed-off-by: Sean O. Stalley
Acked-by: David Rientjes
Cc: Vinod Koul
Cc: Bjorn Helgaas
Cc: Gilles Muller
Cc: Nicolas Palix
Cc: Michal Marek
Cc: Sebastian Andrzej Siewior
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sean O. Stalley
2015-09-09 06:35:28 +0800
c54839a72 vmscan: fix increasing nr_isolated incurred by putback unevictable pages ... Browse Code »

reclaim_clean_pages_from_list() assumes that shrink_page_list() returns
number of pages removed from the candidate list. But shrink_page_list()
puts back mlocked pages without passing it to caller and without
counting as nr_reclaimed. This increases nr_isolated.

To fix this, this patch changes shrink_page_list() to pass unevictable
pages back to caller. Caller will take care those pages.

Minchan said:

It fixes two issues.

1. With unevictable page, cma_alloc will be successful.

Exactly speaking, cma_alloc of current kernel will fail due to
unevictable pages.

2. fix leaking of NR_ISOLATED counter of vmstat

With it, too_many_isolated works. Otherwise, it could make hang until
the process get SIGKILL.

Signed-off-by: Jaewon Kim
Acked-by: Minchan Kim
Cc: Mel Gorman
Acked-by: Vlastimil Babka
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jaewon Kim
2015-09-09 06:35:28 +0800
0b802f101 mm: vmscan: never isolate more pages than necessary ... Browse Code »

If transparent huge pages are enabled, we can isolate many more pages
than we actually need to scan, because we count both single and huge
pages equally in isolate_lru_pages().

Since commit 5bc7b8aca942d ("mm: thp: add split tail pages to shrink
page list in page reclaim"), we scan all the tail pages immediately
after a huge page split (see shrink_page_list()). As a result, we can
reclaim up to SWAP_CLUSTER_MAX * HPAGE_PMD_NR (512 MB) in one run!

This is easy to catch on memcg reclaim with zswap enabled. The latter
makes swapout instant so that if we happen to scan an unreferenced huge
page we will evict both its head and tail pages immediately, which is
likely to result in excessive reclaim.

Signed-off-by: Vladimir Davydov
Cc: Johannes Weiner
Reviewed-by: Michal Hocko
Cc: Vlastimil Babka
Cc: Minchan Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-09-09 06:35:28 +0800
64b990d29 mm: drop __nocast from vm_flags_t definition ... Browse Code »

__nocast does no good for vm_flags_t. It only produces useless sparse
warnings.

Let's drop it.

Signed-off-by: Kirill A. Shutemov
Cc: Oleg Nesterov
Acked-by: David Rientjes
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-09-09 06:35:28 +0800
1b4ace414 bootmem: avoid freeing to bootmem after bootmem is done ... Browse Code »

Bootmem isn't popular any more, but some architectures still use it, and
freeing to bootmem after calling free_all_bootmem_core() can end up
scribbling over random memory. Instead, make sure the kernel generates
a warning in this case by ensuring the node_bootmem_map field is
non-NULL when are freeing or marking bootmem.

An instance of this bug was just fixed in the tile architecture ("tile:
use free_bootmem_late() for initrd") and catching this case more widely
seems like a good thing.

Signed-off-by: Chris Metcalf
Acked-by: Mel Gorman
Cc: Yasuaki Ishimatsu
Cc: Pekka Enberg
Cc: Paul McQuade
Cc: Tang Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Metcalf
2015-09-09 06:35:28 +0800
c5b4e1b02 mm, page_isolation: make set/unset_migratetype_isolate() file-local ... Browse Code »

Nowaday, set/unset_migratetype_isolate() is defined and used only in
mm/page_isolation, so let's limit the scope within the file.

Signed-off-by: Naoya Horiguchi
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Joonsoo Kim
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-09-09 06:35:28 +0800
acda0c334 mm/mempolicy.c: get rid of duplicated check for vma(VM_PFNMAP) in queue_pages_range() ... Browse Code »

This check was introduced as part of
6f4576e3687 ("mempolicy: apply page table walker on queue_pages_range()")

which got duplicated by
48684a65b4e ("mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)")

by reintroducing it earlier on queue_page_test_walk()

Signed-off-by: Aristeu Rozanski
Acked-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Acked-by: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Kirill A. Shutemov
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aristeu Rozanski
2015-09-09 06:35:28 +0800