Eric Lee / smarc-fsl-linux-kernel

10 Oct, 2014

40 commits

6d7ce5594 mm, compaction: pass gfp mask to compact_control ... Browse Code »

struct compact_control currently converts the gfp mask to a migratetype,
but we need the entire gfp mask in a follow-up patch.

Pass the entire gfp mask as part of struct compact_control.

Signed-off-by: David Rientjes
Signed-off-by: Vlastimil Babka
Reviewed-by: Zhang Yanfei
Acked-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-10-10 10:25:55 +0800
43e7a34d2 mm: rename allocflags_to_migratetype for clarity ... Browse Code »

The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
ALLOC_CPUSET) that have separate semantics.

The function allocflags_to_migratetype() actually takes gfp flags, not
alloc flags, and returns a migratetype. Rename it to
gfpflags_to_migratetype().

Signed-off-by: David Rientjes
Signed-off-by: Vlastimil Babka
Reviewed-by: Zhang Yanfei
Reviewed-by: Naoya Horiguchi
Acked-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Christoph Lameter
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-10-10 10:25:55 +0800
99c0fd5e5 mm, compaction: skip buddy pages by their order in the migrate scanner ... Browse Code »

The migration scanner skips PageBuddy pages, but does not consider their
order as checking page_order() is generally unsafe without holding the
zone->lock, and acquiring the lock just for the check wouldn't be a good
tradeoff.

Still, this could avoid some iterations over the rest of the buddy page,
and if we are careful, the race window between PageBuddy() check and
page_order() is small, and the worst thing that can happen is that we skip
too much and miss some isolation candidates. This is not that bad, as
compaction can already fail for many other reasons like parallel
allocations, and those have much larger race window.

This patch therefore makes the migration scanner obtain the buddy page
order and use it to skip the whole buddy page, if the order appears to be
in the valid range.

It's important that the page_order() is read only once, so that the value
used in the checks and in the pfn calculation is the same. But in theory
the compiler can replace the local variable by multiple inlines of
page_order(). Therefore, the patch introduces page_order_unsafe() that
uses ACCESS_ONCE to prevent this.

Testing with stress-highalloc from mmtests shows a 15% reduction in number
of pages scanned by migration scanner. The reduction is >60% with
__GFP_NO_KSWAPD allocations, along with success rates better by few
percent.

Signed-off-by: Vlastimil Babka
Reviewed-by: Zhang Yanfei
Acked-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-10-10 10:25:54 +0800
e14c720ef mm, compaction: remember position within pageblock in free pages scanner ... Browse Code »

Unlike the migration scanner, the free scanner remembers the beginning of
the last scanned pageblock in cc->free_pfn. It might be therefore
rescanning pages uselessly when called several times during single
compaction. This might have been useful when pages were returned to the
buddy allocator after a failed migration, but this is no longer the case.

This patch changes the meaning of cc->free_pfn so that if it points to a
middle of a pageblock, that pageblock is scanned only from cc->free_pfn to
the end. isolate_freepages_block() will record the pfn of the last page
it looked at, which is then used to update cc->free_pfn.

In the mmtests stress-highalloc benchmark, this has resulted in lowering
the ratio between pages scanned by both scanners, from 2.5 free pages per
migrate page, to 2.25 free pages per migrate page, without affecting
success rates.

With __GFP_NO_KSWAPD allocations, this appears to result in a worse ratio
(2.1 instead of 1.8), but page migration successes increased by 10%, so
this could mean that more useful work can be done until need_resched()
aborts this kind of compaction.

Signed-off-by: Vlastimil Babka
Reviewed-by: Zhang Yanfei
Reviewed-by: Naoya Horiguchi
Acked-by: David Rientjes
Acked-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-10-10 10:25:54 +0800
69b7189f1 mm, compaction: skip rechecks when lock was already held ... Browse Code »

Compaction scanners try to lock zone locks as late as possible by checking
many page or pageblock properties opportunistically without lock and
skipping them if not unsuitable. For pages that pass the initial checks,
some properties have to be checked again safely under lock. However, if
the lock was already held from a previous iteration in the initial checks,
the rechecks are unnecessary.

This patch therefore skips the rechecks when the lock was already held.
This is now possible to do, since we don't (potentially) drop and
reacquire the lock between the initial checks and the safe rechecks
anymore.

Signed-off-by: Vlastimil Babka
Reviewed-by: Zhang Yanfei
Reviewed-by: Naoya Horiguchi
Acked-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-10-10 10:25:54 +0800
8b44d2791 mm, compaction: periodically drop lock and restore IRQs in scanners ... Browse Code »

Compaction scanners regularly check for lock contention and need_resched()
through the compact_checklock_irqsave() function. However, if there is no
contention, the lock can be held and IRQ disabled for potentially long
time.

This has been addressed by commit b2eef8c0d091 ("mm: compaction: minimise
the time IRQs are disabled while isolating pages for migration") for the
migration scanner. However, the refactoring done by commit 2a1402aa044b
("mm: compaction: acquire the zone->lru_lock as late as possible") has
changed the conditions so that the lock is dropped only when there's
contention on the lock or need_resched() is true. Also, need_resched() is
checked only when the lock is already held. The comment "give a chance to
irqs before checking need_resched" is therefore misleading, as IRQs remain
disabled when the check is done.

This patch restores the behavior intended by commit b2eef8c0d091 and also
tries to better balance and make more deterministic the time spent by
checking for contention vs the time the scanners might run between the
checks. It also avoids situations where checking has not been done often
enough before. The result should be avoiding both too frequent and too
infrequent contention checking, and especially the potentially
long-running scans with IRQs disabled and no checking of need_resched() or
for fatal signal pending, which can happen when many consecutive pages or
pageblocks fail the preliminary tests and do not reach the later call site
to compact_checklock_irqsave(), as explained below.

Before the patch:

In the migration scanner, compact_checklock_irqsave() was called each
loop, if reached. If not reached, some lower-frequency checking could
still be done if the lock was already held, but this would not result in
aborting contended async compaction until reaching
compact_checklock_irqsave() or end of pageblock. In the free scanner, it
was similar but completely without the periodical checking, so lock can be
potentially held until reaching the end of pageblock.

After the patch, in both scanners:

The periodical check is done as the first thing in the loop on each
SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
function, which always unlocks the lock (if locked) and aborts async
compaction if scheduling is needed. It also aborts any type of compaction
when a fatal signal is pending.

The compact_checklock_irqsave() function is replaced with a slightly
different compact_trylock_irqsave(). The biggest difference is that the
function is not called at all if the lock is already held. The periodical
need_resched() checking is left solely to compact_unlock_should_abort().
The lock contention avoidance for async compaction is achieved by the
periodical unlock by compact_unlock_should_abort() and by using trylock in
compact_trylock_irqsave() and aborting when trylock fails. Sync
compaction does not use trylock.

Signed-off-by: Vlastimil Babka
Reviewed-by: Zhang Yanfei
Acked-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-10-10 10:25:54 +0800
1f9efdef4 mm, compaction: khugepaged should not give up due to need_resched() ... Browse Code »

Async compaction aborts when it detects zone lock contention or
need_resched() is true. David Rientjes has reported that in practice,
most direct async compactions for THP allocation abort due to
need_resched(). This means that a second direct compaction is never
attempted, which might be OK for a page fault, but khugepaged is intended
to attempt a sync compaction in such case and in these cases it won't.

This patch replaces "bool contended" in compact_control with an int that
distinguishes between aborting due to need_resched() and aborting due to
lock contention. This allows propagating the abort through all compaction
functions as before, but passing the abort reason up to
__alloc_pages_slowpath() which decides when to continue with direct
reclaim and another compaction attempt.

Another problem is that try_to_compact_pages() did not act upon the
reported contention (both need_resched() or lock contention) immediately
and would proceed with another zone from the zonelist. When
need_resched() is true, that means initializing another zone compaction,
only to check again need_resched() in isolate_migratepages() and aborting.
For zone lock contention, the unintended consequence is that the lock
contended status reported back to the allocator is detrmined from the last
zone where compaction was attempted, which is rather arbitrary.

This patch fixes the problem in the following way:
- async compaction of a zone aborting due to need_resched() or fatal signal
pending means that further zones should not be tried. We report
COMPACT_CONTENDED_SCHED to the allocator.
- aborting zone compaction due to lock contention means we can still try
another zone, since it has different set of locks. We report back
COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
it was aborted due to lock contention.

As a result of these fixes, khugepaged will proceed with second sync
compaction as intended, when the preceding async compaction aborted due to
need_resched(). Page fault compactions aborting due to need_resched()
will spare some cycles previously wasted by initializing another zone
compaction only to abort again. Lock contention will be reported only
when compaction in all zones aborted due to lock contention, and therefore
it's not a good idea to try again after reclaim.

In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
has improved number of THP collapse allocations by 10%, which shows
positive effect on khugepaged. The benchmark's success rates are
unchanged as it is not recognized as khugepaged. Numbers of compact_stall
and compact_fail events have however decreased by 20%, with
compact_success still a bit improved, which is good. With benchmark
configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
collapse allocations, and only slight improvement in stalls and failures.

[akpm@linux-foundation.org: fix warnings]
Reported-by: David Rientjes
Signed-off-by: Vlastimil Babka
Cc: Minchan Kim
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-10-10 10:25:54 +0800
7d49d8868 mm, compaction: reduce zone checking frequency in the migration scanner ... Browse Code »

The unification of the migrate and free scanner families of function has
highlighted a difference in how the scanners ensure they only isolate
pages of the intended zone. This is important for taking zone lock or lru
lock of the correct zone. Due to nodes overlapping, it is however
possible to encounter a different zone within the range of the zone being
compacted.

The free scanner, since its inception by commit 748446bb6b5a ("mm:
compaction: memory compaction core"), has been checking the zone of the
first valid page in a pageblock, and skipping the whole pageblock if the
zone does not match.

This checking was completely missing from the migration scanner at first,
and later added by commit dc9086004b3d ("mm: compaction: check for
overlapping nodes during isolation for migration") in a reaction to a bug
report. But the zone comparison in migration scanner is done once per a
single scanned page, which is more defensive and thus more costly than a
check per pageblock.

This patch unifies the checking done in both scanners to once per
pageblock, through a new pageblock_pfn_to_page() function, which also
includes pfn_valid() checks. It is more defensive than the current free
scanner checks, as it checks both the first and last page of the
pageblock, but less defensive by the migration scanner per-page checks.
It assumes that node overlapping may result (on some architecture) in a
boundary between two nodes falling into the middle of a pageblock, but
that there cannot be a node0 node1 node0 interleaving within a single
pageblock.

The result is more code being shared and a bit less per-page CPU cost in
the migration scanner.

Signed-off-by: Vlastimil Babka
Cc: Minchan Kim
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-10-10 10:25:54 +0800
edc2ca612 mm, compaction: move pageblock checks up from isolate_migratepages_range() ... Browse Code »

isolate_migratepages_range() is the main function of the compaction
scanner, called either on a single pageblock by isolate_migratepages()
during regular compaction, or on an arbitrary range by CMA's
__alloc_contig_migrate_range(). It currently perfoms two pageblock-wide
compaction suitability checks, and because of the CMA callpath, it tracks
if it crossed a pageblock boundary in order to repeat those checks.

However, closer inspection shows that those checks are always true for CMA:
- isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
- migrate_async_suitable() check is skipped because CMA uses sync compaction

We can therefore move the compaction-specific checks to
isolate_migratepages() and simplify isolate_migratepages_range().
Furthermore, we can mimic the freepage scanner family of functions, which
has isolate_freepages_block() function called both by compaction from
isolate_freepages() and by CMA from isolate_freepages_range(), where each
use-case adds own specific glue code. This allows further code
simplification.

Thus, we rename isolate_migratepages_range() to
isolate_migratepages_block() and limit its functionality to a single
pageblock (or its subset). For CMA, a new different
isolate_migratepages_range() is created as a CMA-specific wrapper for the
_block() function. The checks specific to compaction are moved to
isolate_migratepages(). As part of the unification of these two families
of functions, we remove the redundant zone parameter where applicable,
since zone pointer is already passed in cc->zone.

Furthermore, going back to compact_zone() and compact_finished() when
pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
- the checks are meant to skip pageblocks quickly. The patch therefore
also introduces a simple loop into isolate_migratepages() so that it does
not return immediately on failed pageblock checks, but keeps going until
isolate_migratepages_range() gets called once. Similarily to
isolate_freepages(), the function periodically checks if it needs to
reschedule or abort async compaction.

[iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
Signed-off-by: Vlastimil Babka
Cc: Minchan Kim
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-10-10 10:25:54 +0800
f8224aa5a mm, compaction: do not recheck suitable_migration_target under lock ... Browse Code »

isolate_freepages_block() rechecks if the pageblock is suitable to be a
target for migration after it has taken the zone->lock. However, the
check has been optimized to occur only once per pageblock, and
compact_checklock_irqsave() might be dropping and reacquiring lock, which
means somebody else might have changed the pageblock's migratetype
meanwhile.

Furthermore, nothing prevents the migratetype to change right after
isolate_freepages_block() has finished isolating. Given how imperfect
this is, it's simpler to just rely on the check done in
isolate_freepages() without lock, and not pretend that the recheck under
lock guarantees anything. It is just a heuristic after all.

Signed-off-by: Vlastimil Babka
Reviewed-by: Zhang Yanfei
Acked-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-10-10 10:25:54 +0800
98dd3b48a mm, compaction: do not count compact_stall if all zones skipped compaction ... Browse Code »

The compact_stall vmstat counter counts the number of allocations stalled
by direct compaction. It does not count when all attempted zones had
deferred compaction, but it does count when all zones skipped compaction.
The skipping is decided based on very early check of
compaction_suitable(), based on watermarks and memory fragmentation.
Therefore it makes sense not to count skipped compactions as stalls.
Moreover, compact_success or compact_fail is also already not being
counted when compaction was skipped, so this patch changes the
compact_stall counting to match the other two.

Additionally, restructure __alloc_pages_direct_compact() code for better
readability.

Signed-off-by: Vlastimil Babka
Cc: Minchan Kim
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-10-10 10:25:53 +0800
53853e2d2 mm, compaction: defer each zone individually instead of preferred zone ... Browse Code »

When direct sync compaction is often unsuccessful, it may become deferred
for some time to avoid further useless attempts, both sync and async.
Successful high-order allocations un-defer compaction, while further
unsuccessful compaction attempts prolong the compaction deferred period.

Currently the checking and setting deferred status is performed only on
the preferred zone of the allocation that invoked direct compaction. But
compaction itself is attempted on all eligible zones in the zonelist, so
the behavior is suboptimal and may lead both to scenarios where 1)
compaction is attempted uselessly, or 2) where it's not attempted despite
good chances of succeeding, as shown on the examples below:

1) A direct compaction with Normal preferred zone failed and set
deferred compaction for the Normal zone. Another unrelated direct
compaction with DMA32 as preferred zone will attempt to compact DMA32
zone even though the first compaction attempt also included DMA32 zone.

In another scenario, compaction with Normal preferred zone failed to
compact Normal zone, but succeeded in the DMA32 zone, so it will not
defer compaction. In the next attempt, it will try Normal zone which
will fail again, instead of skipping Normal zone and trying DMA32
directly.

2) Kswapd will balance DMA32 zone and reset defer status based on
watermarks looking good. A direct compaction with preferred Normal
zone will skip compaction of all zones including DMA32 because Normal
was still deferred. The allocation might have succeeded in DMA32, but
won't.

This patch makes compaction deferring work on individual zone basis
instead of preferred zone. For each zone, it checks compaction_deferred()
to decide if the zone should be skipped. If watermarks fail after
compacting the zone, defer_compaction() is called. The zone where
watermarks passed can still be deferred when the allocation attempt is
unsuccessful. When allocation is successful, compaction_defer_reset() is
called for the zone containing the allocated page. This approach should
approximate calling defer_compaction() only on zones where compaction was
attempted and did not yield allocated page. There might be corner cases
but that is inevitable as long as the decision to stop compacting dues not
guarantee that a page will be allocated.

Due to a new COMPACT_DEFERRED return value, some functions relying
implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
more accurate. The did_some_progress output parameter of
__alloc_pages_direct_compact() is removed completely, as the caller
actually does not use it after compaction sets it - it is only considered
when direct reclaim sets it.

During testing on a two-node machine with a single very small Normal zone
on node 1, this patch has improved success rates in stress-highalloc
mmtests benchmark. The success here were previously made worse by commit
3a025760fc15 ("mm: page_alloc: spill to remote nodes before waking
kswapd") as kswapd was no longer resetting often enough the deferred
compaction for the Normal zone, and DMA32 zones on both nodes were thus
not considered for compaction. On different machine, success rates were
improved with __GFP_NO_KSWAPD allocations.

[akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
Signed-off-by: Vlastimil Babka
Acked-by: Minchan Kim
Reviewed-by: Zhang Yanfei
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-10-10 10:25:53 +0800
8b1645685 mm, THP: don't hold mmap_sem in khugepaged when allocating THP ... Browse Code »

When allocating huge page for collapsing, khugepaged currently holds
mmap_sem for reading on the mm where collapsing occurs. Afterwards the
read lock is dropped before write lock is taken on the same mmap_sem.

Holding mmap_sem during whole huge page allocation is therefore useless,
the vma needs to be rechecked after taking the write lock anyway.
Furthemore, huge page allocation might involve a rather long sync
compaction, and thus block any mmap_sem writers and i.e. affect workloads
that perform frequent m(un)map or mprotect oterations.

This patch simply releases the read lock before allocating a huge page.
It also deletes an outdated comment that assumed vma must be stable, as it
was using alloc_hugepage_vma(). This is no longer true since commit
9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node").

Signed-off-by: Vlastimil Babka
Cc: Minchan Kim
Acked-by: Mel Gorman
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-10-10 10:25:53 +0800
447f05bb4 block_dev: implement readpages() to optimize sequential read ... Browse Code »

Sequential read from a block device is expected to be equal or faster than
from the file on a filesystem. But it is not correct due to the lack of
effective readpages() in the address space operations for block device.

This implements readpages() operation for block device by using
mpage_readpages() which can create multipage BIOs instead of BIOs for each
page and reduce system CPU time consumption.

Install 1GB of RAM disk storage:

# modprobe scsi_debug dev_size_mb=1024 delay=0

Sequential read from file on a filesystem:

# mkfs.ext4 /dev/$DEV
# mount /dev/$DEV /mnt
# fio --name=t --size=512m --rw=read --filename=/mnt/file
...
read : io=524288KB, bw=2133.4MB/s, iops=546133, runt= 240msec

Sequential read from a block device:
# fio --name=t --size=512m --rw=read --filename=/dev/$DEV
...
(Without this commit)
read : io=524288KB, bw=1700.2MB/s, iops=435455, runt= 301msec

(With this commit)
read : io=524288KB, bw=2160.4MB/s, iops=553046, runt= 237msec

Signed-off-by: Akinobu Mita
Cc: Jens Axboe
Cc: Alexander Viro
Cc: Jeff Moyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2014-10-10 10:25:53 +0800
4db96b71e vfs: guard end of device for mpage interface ... Browse Code »

Add guard_bio_eod() check for mpage code in order to allow us to do IO
even on the odd last sectors of a device, even if the block size is some
multiple of the physical sector size.

Using mpage_readpages() for block device requires this guard check.

Signed-off-by: Akinobu Mita
Cc: Jens Axboe
Cc: Alexander Viro
Cc: Jeff Moyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2014-10-10 10:25:53 +0800
59d43914e vfs: make guard_bh_eod() more generic ... Browse Code »

This patchset implements readpages() operation for block device by using
mpage_readpages() which can create multipage BIOs instead of BIOs for each
page and reduce system CPU time consumption.

This patch (of 3):

guard_bh_eod() is used in submit_bh() to allow us to do IO even on the odd
last sectors of a device, even if the block size is some multiple of the
physical sector size. This makes guard_bh_eod() more generic and renames
it guard_bio_eod() so that we can use it without struct buffer_head
argument.

The reason for this change is that using mpage_readpages() for block
device requires to add this guard check in mpage code.

Signed-off-by: Akinobu Mita
Cc: Jens Axboe
Cc: Alexander Viro
Cc: Jeff Moyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2014-10-10 10:25:53 +0800
21bb9bd19 mm: page_alloc: determine migratetype only once ... Browse Code »

The check for ALLOC_CMA in __alloc_pages_nodemask() derives migratetype
from gfp_mask in each retry pass, although the migratetype variable
already has the value determined and it does not change. Use the variable
and perform the check only once. Also convert #ifdef CONFIG_CMA to
IS_ENABLED.

Signed-off-by: Vlastimil Babka
Acked-by: David Rientjes
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: "Srivatsa S. Bhat"
Cc: Hugh Dickins
Cc: Minchan Kim
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-10-10 10:25:53 +0800
95b0e655f ARM: mm: don't limit default CMA region only to low memory ... Browse Code »

DMA-mapping supports CMA regions places either in low or high memory, so
there is no longer needed to limit default CMA regions only to low memory.
The real limit is still defined by architecture specific DMA limit.

Signed-off-by: Marek Szyprowski
Reported-by: Russell King - ARM Linux
Acked-by: Michal Nazarewicz
Cc: Daniel Drake
Cc: Minchan Kim
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Marek Szyprowski
2014-10-10 10:25:53 +0800
f7426b983 mm: cma: adjust address limit to avoid hitting low/high memory boundary ... Browse Code »

Russell King recently noticed that limiting default CMA region only to low
memory on ARM architecture causes serious memory management issues with
machines having a lot of memory (which is mainly available as high
memory). More information can be found the following thread:
http://thread.gmane.org/gmane.linux.ports.arm.kernel/348441/

Those two patches removes this limit letting kernel to put default CMA
region into high memory when this is possible (there is enough high memory
available and architecture specific DMA limit fits).

This should solve strange OOM issues on systems with lots of RAM (i.e.
>1GiB) and large (>256M) CMA area.

This patch (of 2):

Automatically allocated regions should not cross low/high memory boundary,
because such regions cannot be later correctly initialized due to spanning
across two memory zones. This patch adds a check for this case and a
simple code for moving region to low memory if automatically selected
address might not fit completely into high memory.

Signed-off-by: Marek Szyprowski
Acked-by: Michal Nazarewicz
Cc: Daniel Drake
Cc: Minchan Kim
Cc: Russell King
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Marek Szyprowski
2014-10-10 10:25:53 +0800
d4932f9e8 arm64: add atomic pool for non-coherent and CMA allocations ... Browse Code »

Neither CMA nor noncoherent allocations support atomic allocations.
Add a dedicated atomic pool to support this.

Reviewed-by: Catalin Marinas
Signed-off-by: Laura Abbott
Cc: Arnd Bergmann
Cc: David Riley
Cc: Olof Johansson
Cc: Ritesh Harjain
Cc: Russell King
Cc: Thierry Reding
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laura Abbott
2014-10-10 10:25:52 +0800
36d0fd219 arm: use genalloc for the atomic pool ... Browse Code »

ARM currently uses a bitmap for tracking atomic allocations. genalloc
already handles this type of memory pool allocation so switch to using
that instead.

Signed-off-by: Laura Abbott
Reviewed-by: Catalin Marinas
Cc: Arnd Bergmann
Cc: David Riley
Cc: Olof Johansson
Cc: Ritesh Harjain
Cc: Russell King
Cc: Thierry Reding
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laura Abbott
2014-10-10 10:25:52 +0800
513510ddb common: dma-mapping: introduce common remapping functions ... Browse Code »

For architectures without coherent DMA, memory for DMA may need to be
remapped with coherent attributes. Factor out the the remapping code from
arm and put it in a common location to reduce code duplication.

As part of this, the arm APIs are now migrated away from
ioremap_page_range to the common APIs which use map_vm_area for remapping.
This should be an equivalent change and using map_vm_area is more correct
as ioremap_page_range is intended to bring in io addresses into the cpu
space and not regular kernel managed memory.

Signed-off-by: Laura Abbott
Reviewed-by: Catalin Marinas
Cc: Arnd Bergmann
Cc: David Riley
Cc: Olof Johansson
Cc: Ritesh Harjain
Cc: Russell King
Cc: Thierry Reding
Cc: Will Deacon
Cc: James Hogan
Cc: Laura Abbott
Cc: Mitchel Humpherys
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laura Abbott
2014-10-10 10:25:52 +0800
9efb3a421 lib/genalloc.c: add genpool range check function ... Browse Code »

After allocating an address from a particular genpool, there is no good
way to verify if that address actually belongs to a genpool. Introduce
addr_in_gen_pool which will return if an address plus size falls
completely within the genpool range.

Signed-off-by: Laura Abbott
Acked-by: Will Deacon
Reviewed-by: Olof Johansson
Reviewed-by: Catalin Marinas
Cc: Arnd Bergmann
Cc: David Riley
Cc: Ritesh Harjain
Cc: Russell King
Cc: Thierry Reding
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laura Abbott
2014-10-10 10:25:52 +0800
505e3be6c lib/genalloc.c: add power aligned algorithm ... Browse Code »

One of the more common algorithms used for allocation is to align the
start address of the allocation to the order of size requested. Add this
as an algorithm option for genalloc.

Signed-off-by: Laura Abbott
Acked-by: Will Deacon
Acked-by: Olof Johansson
Reviewed-by: Catalin Marinas
Cc: Arnd Bergmann
Cc: David Riley
Cc: Ritesh Harjain
Cc: Russell King
Cc: Thierry Reding
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laura Abbott
2014-10-10 10:25:52 +0800
6a33979d5 mm: remove misleading ARCH_USES_NUMA_PROT_NONE ... Browse Code »

ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
_PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and
relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
fault scanner. This was found to be conceptually confusing with a lot of
implicit assumptions and it was asked that an alternative be found.

Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the
PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap PTE
bits and shrunk the maximum possible swap size but it did not go far
enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA
but the relics still exist.

This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
duplication in powerpc vs the generic implementation by defining the types
the core NUMA helpers expected to exist from x86 with their ppc64
equivalent. This necessitated that a PTE bit mask be created that
identified the bits that distinguish present from NUMA pte entries but it
is expected this will only differ between arches based on _PAGE_PROTNONE.
The naming for the generic helpers was taken from x86 originally but ppc64
has types that are equivalent for the purposes of the helper so they are
mapped instead of duplicating code.

Signed-off-by: Mel Gorman
Cc: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Cyrill Gorcunov
Reviewed-by: Aneesh Kumar K.V
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-10-10 10:25:52 +0800
ed2f24009 memory-hotplug: add sysfs valid_zones attribute ... Browse Code »

Currently memory-hotplug has two limits:

1. If the memory block is in ZONE_NORMAL, you can change it to
ZONE_MOVABLE, but this memory block must be adjacent to ZONE_MOVABLE.

2. If the memory block is in ZONE_MOVABLE, you can change it to
ZONE_NORMAL, but this memory block must be adjacent to ZONE_NORMAL.

With this patch, we can easy to know a memory block can be onlined to
which zone, and don't need to know the above two limits.

Updated the related Documentation.

[akpm@linux-foundation.org: use conventional comment layout]
[akpm@linux-foundation.org: fix build with CONFIG_MEMORY_HOTREMOVE=n]
[akpm@linux-foundation.org: remove unused local zone_prev]
Signed-off-by: Zhang Zhen
Cc: Dave Hansen
Cc: David Rientjes
Cc: Toshi Kani
Cc: Yasuaki Ishimatsu
Cc: Naoya Horiguchi
Cc: Wang Nan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Zhen
2014-10-10 10:25:52 +0800
cc71aba34 mm/mmap.c: whitespace fixes ... Browse Code »

Signed-off-by: vishnu.ps
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

vishnu.ps
2014-10-10 10:25:52 +0800
bf0dea23a mm/slab: use percpu allocator for cpu cache ... Browse Code »

Because of chicken and egg problem, initialization of SLAB is really
complicated. We need to allocate cpu cache through SLAB to make the
kmem_cache work, but before initialization of kmem_cache, allocation
through SLAB is impossible.

On the other hand, SLUB does initialization in a more simple way. It uses
percpu allocator to allocate cpu cache so there is no chicken and egg
problem.

So, this patch try to use percpu allocator in SLAB. This simplifies the
initialization step in SLAB so that we could maintain SLAB code more
easily.

In my testing there is no performance difference.

This implementation relies on percpu allocator. Because percpu allocator
uses vmalloc address space, vmalloc address space could be exhausted by
this change on many cpu system with *32 bit* kernel. This implementation
can cover 1024 cpus in worst case by following calculation.

Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches *
120 objects per cpu_cache = 140 MB
Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) *
80 objects per cpu_cache = 46 MB

Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Jeremiah Mahler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:51 +0800
12220dea0 mm/slab: support slab merge ... Browse Code »

Slab merge is good feature to reduce fragmentation. If new creating slab
have similar size and property with exsitent slab, this feature reuse it
rather than creating new one. As a result, objects are packed into fewer
slabs so that fragmentation is reduced.

Below is result of my testing.

* After boot, sleep 20; cat /proc/meminfo | grep Slab

Slab: 25136 kB

Slab: 24364 kB

We can save 3% memory used by slab.

For supporting this feature in SLAB, we need to implement SLAB specific
kmem_cache_flag() and __kmem_cache_alias(), because SLUB implements some
SLUB specific processing related to debug flag and object size change on
these functions.

Signed-off-by: Joonsoo Kim
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:51 +0800
423c929cb mm/slab_common: commonize slab merge logic ... Browse Code »

Slab merge is good feature to reduce fragmentation. Now, it is only
applied to SLUB, but, it would be good to apply it to SLAB. This patch is
preparation step to apply slab merge to SLAB by commonizing slab merge
logic.

Signed-off-by: Joonsoo Kim
Cc: Randy Dunlap
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:51 +0800
9163582c3 slab: fix for_each_kmem_cache_node() ... Browse Code »

Fix a bug (discovered with kmemcheck) in for_each_kmem_cache_node(). The
for loop reads the array "node" before verifying that the index is within
the range. This results in kmemcheck warning.

Signed-off-by: Mikulas Patocka
Reviewed-by: Pekka Enberg
Acked-by: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mikulas Patocka
2014-10-10 10:25:51 +0800
109228389 kernel/kthread.c: partial revert of 81c98869faa5 ("kthread: ensure locality of t… ... Browse Code »

…ask_struct allocations")

After discussions with Tejun, we don't want to spread the use of
cpu_to_mem() (and thus knowledge of allocators/NUMA topology details) into
callers, but would rather ensure the callees correctly handle memoryless
nodes. With the previous patches ("topology: add support for
node_to_mem_node() to determine the fallback node" and "slub: fallback to
node_to_mem_node() node if allocating on memoryless node") adding and
using node_to_mem_node(), we can safely undo part of the change to the
kthread logic from 81c98869faa5.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anton Blanchard <anton@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Nishanth Aravamudan
2014-10-10 10:25:51 +0800
a561ce00b slub: fall back to node_to_mem_node() node if allocating on memoryless node ... Browse Code »

Update the SLUB code to search for partial slabs on the nearest node with
memory in the presence of memoryless nodes. Additionally, do not consider
it to be an ALLOC_NODE_MISMATCH (and deactivate the slab) when a
memoryless-node specified allocation goes off-node.

Signed-off-by: Joonsoo Kim
Signed-off-by: Nishanth Aravamudan
Cc: David Rientjes
Cc: Han Pingtian
Cc: Pekka Enberg
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Michael Ellerman
Cc: Anton Blanchard
Cc: Christoph Lameter
Cc: Wanpeng Li
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:51 +0800
ad2c81444 topology: add support for node_to_mem_node() to determine the fallback node ... Browse Code »

Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that
on ppc LPARs with memoryless nodes, a large amount of memory was consumed
by slabs and was marked unreclaimable. He tracked it down to slab
deactivations in the SLUB core when we allocate remotely, leading to poor
efficiency always when memoryless nodes are present.

After much discussion, Joonsoo provided a few patches that help
significantly. They don't resolve the problem altogether:

- memory hotplug still needs testing, that is when a memoryless node
becomes memory-ful, we want to dtrt
- there are other reasons for going off-node than memoryless nodes,
e.g., fully exhausted local nodes

Neither case is resolved with this series, but I don't think that should
block their acceptance, as they can be explored/resolved with follow-on
patches.

The series consists of:

[1/3] topology: add support for node_to_mem_node() to determine the
fallback node

[2/3] slub: fallback to node_to_mem_node() node if allocating on
memoryless node

- Joonsoo's patches to cache the nearest node with memory for each
NUMA node

[3/3] Partial revert of 81c98869faa5 (""kthread: ensure locality of
task_struct allocations")

- At Tejun's request, keep the knowledge of memoryless node fallback
to the allocator core.

This patch (of 3):

We need to determine the fallback node in slub allocator if the allocation
target node is memoryless node. Without it, the SLUB wrongly select the
node which has no memory and can't use a partial slab, because of node
mismatch. Introduced function, node_to_mem_node(X), will return a node Y
with memory that has the nearest distance. If X is memoryless node, it
will return nearest distance node, but, if X is normal node, it will
return itself.

We will use this function in following patch to determine the fallback
node.

Signed-off-by: Joonsoo Kim
Signed-off-by: Nishanth Aravamudan
Cc: David Rientjes
Cc: Han Pingtian
Cc: Pekka Enberg
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Michael Ellerman
Cc: Anton Blanchard
Cc: Christoph Lameter
Cc: Wanpeng Li
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:51 +0800
c9e16131d slub: disable tracing and failslab for merged slabs ... Browse Code »

Tracing of mergeable slabs as well as uses of failslab are confusing since
the objects of multiple slab caches will be affected. Moreover this
creates a situation where a mergeable slab will become unmergeable.

If tracing or failslab testing is desired then it may be best to switch
merging off for starters.

Signed-off-by: Christoph Lameter
Tested-by: WANG Chao
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-10-10 10:25:51 +0800
25c4f304b mm/slab: factor out unlikely part of cache_free_alien() ... Browse Code »

cache_free_alien() is rarely used function when node mismatch. But, it is
defined with inline attribute so it is inlined to __cache_free() which is
core free function of slab allocator. It uselessly makes
kmem_cache_free()/kfree() functions large. What we really need to inline
is just checking node match so this patch factor out other parts of
cache_free_alien() to reduce code size of kmem_cache_free()/ kfree().

nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
00000000000011e0 0000000000000228 T kfree
0000000000000670 0000000000000216 T kmem_cache_free

nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
0000000000001110 00000000000001b5 T kfree
0000000000000750 0000000000000181 T kmem_cache_free

You can see slightly reduced size of text: 0x228->0x1b5, 0x216->0x181.

Signed-off-by: Joonsoo Kim
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:51 +0800
d3aec3446 mm/slab: noinline __ac_put_obj() ... Browse Code »

Our intention of __ac_put_obj() is that it doesn't affect anything if
sk_memalloc_socks() is disabled. But, because __ac_put_obj() is too
small, compiler inline it to ac_put_obj() and affect code size of free
path. This patch add noinline keyword for __ac_put_obj() not to distrupt
normal free path at all.

nm -S slab-orig.o |
grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"

0000000000001e80 00000000000002f5 t cache_alloc_refill
0000000000001230 0000000000000258 T kfree
0000000000000690 000000000000024c T kmem_cache_free

nm -S slab-patched.o |
grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"

0000000000001e00 00000000000002e5 t cache_alloc_refill
00000000000011e0 0000000000000228 T kfree
0000000000000670 0000000000000216 T kmem_cache_free

cache_alloc_refill: 0x2f5->0x2e5
kfree: 0x256->0x228
kmem_cache_free: 0x24c->0x216

code size of each function is reduced slightly.

Signed-off-by: Joonsoo Kim
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:50 +0800
3d8801940 mm/slab: move cache_flusharray() out of unlikely.text section ... Browse Code »

Now, due to likely keyword, compiled code of cache_flusharray() is on
unlikely.text section. Although it is uncommon case compared to free to
cpu cache case, it is common case than free_block(). But, free_block() is
on normal text section. This patch fix this odd situation to remove
likely keyword.

Signed-off-by: Joonsoo Kim
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:50 +0800
61f47105a mm/sl[ao]b: always track caller in kmalloc_(node_)track_caller() ... Browse Code »

Now, we track caller if tracing or slab debugging is enabled. If they are
disabled, we could save one argument passing overhead by calling
__kmalloc(_node)(). But, I think that it would be marginal. Furthermore,
default slab allocator, SLUB, doesn't use this technique so I think that
it's okay to change this situation.

After this change, we can turn on/off CONFIG_DEBUG_SLAB without full
kernel build and remove some complicated '#if' defintion. It looks more
benefitial to me.

Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:50 +0800
07f361b2b mm/slab_common: move kmem_cache definition to internal header ... Browse Code »

We don't need to keep kmem_cache definition in include/linux/slab.h if we
don't need to inline kmem_cache_size(). According to my code inspection,
this function is only called at lc_create() in lib/lru_cache.c which may
be called at initialization phase of something, so we don't need to inline
it. Therfore, move it to slab_common.c and move kmem_cache definition to
internal header.

After this change, we can change kmem_cache definition easily without full
kernel build. For instance, we can turn on/off CONFIG_SLUB_STATS without
full kernel build.

[akpm@linux-foundation.org: export kmem_cache_size() to modules]
[rdunlap@infradead.org: add header files to fix kmemcheck.c build errors]
Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Zhang Yanfei
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-10-10 10:25:50 +0800