10 Oct, 2014

40 commits

  • struct compact_control currently converts the gfp mask to a migratetype,
    but we need the entire gfp mask in a follow-up patch.

    Pass the entire gfp mask as part of struct compact_control.

    Signed-off-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
    ALLOC_CPUSET) that have separate semantics.

    The function allocflags_to_migratetype() actually takes gfp flags, not
    alloc flags, and returns a migratetype. Rename it to
    gfpflags_to_migratetype().

    Signed-off-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Reviewed-by: Naoya Horiguchi
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The migration scanner skips PageBuddy pages, but does not consider their
    order as checking page_order() is generally unsafe without holding the
    zone->lock, and acquiring the lock just for the check wouldn't be a good
    tradeoff.

    Still, this could avoid some iterations over the rest of the buddy page,
    and if we are careful, the race window between PageBuddy() check and
    page_order() is small, and the worst thing that can happen is that we skip
    too much and miss some isolation candidates. This is not that bad, as
    compaction can already fail for many other reasons like parallel
    allocations, and those have much larger race window.

    This patch therefore makes the migration scanner obtain the buddy page
    order and use it to skip the whole buddy page, if the order appears to be
    in the valid range.

    It's important that the page_order() is read only once, so that the value
    used in the checks and in the pfn calculation is the same. But in theory
    the compiler can replace the local variable by multiple inlines of
    page_order(). Therefore, the patch introduces page_order_unsafe() that
    uses ACCESS_ONCE to prevent this.

    Testing with stress-highalloc from mmtests shows a 15% reduction in number
    of pages scanned by migration scanner. The reduction is >60% with
    __GFP_NO_KSWAPD allocations, along with success rates better by few
    percent.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Unlike the migration scanner, the free scanner remembers the beginning of
    the last scanned pageblock in cc->free_pfn. It might be therefore
    rescanning pages uselessly when called several times during single
    compaction. This might have been useful when pages were returned to the
    buddy allocator after a failed migration, but this is no longer the case.

    This patch changes the meaning of cc->free_pfn so that if it points to a
    middle of a pageblock, that pageblock is scanned only from cc->free_pfn to
    the end. isolate_freepages_block() will record the pfn of the last page
    it looked at, which is then used to update cc->free_pfn.

    In the mmtests stress-highalloc benchmark, this has resulted in lowering
    the ratio between pages scanned by both scanners, from 2.5 free pages per
    migrate page, to 2.25 free pages per migrate page, without affecting
    success rates.

    With __GFP_NO_KSWAPD allocations, this appears to result in a worse ratio
    (2.1 instead of 1.8), but page migration successes increased by 10%, so
    this could mean that more useful work can be done until need_resched()
    aborts this kind of compaction.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Reviewed-by: Naoya Horiguchi
    Acked-by: David Rientjes
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction scanners try to lock zone locks as late as possible by checking
    many page or pageblock properties opportunistically without lock and
    skipping them if not unsuitable. For pages that pass the initial checks,
    some properties have to be checked again safely under lock. However, if
    the lock was already held from a previous iteration in the initial checks,
    the rechecks are unnecessary.

    This patch therefore skips the rechecks when the lock was already held.
    This is now possible to do, since we don't (potentially) drop and
    reacquire the lock between the initial checks and the safe rechecks
    anymore.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Reviewed-by: Naoya Horiguchi
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction scanners regularly check for lock contention and need_resched()
    through the compact_checklock_irqsave() function. However, if there is no
    contention, the lock can be held and IRQ disabled for potentially long
    time.

    This has been addressed by commit b2eef8c0d091 ("mm: compaction: minimise
    the time IRQs are disabled while isolating pages for migration") for the
    migration scanner. However, the refactoring done by commit 2a1402aa044b
    ("mm: compaction: acquire the zone->lru_lock as late as possible") has
    changed the conditions so that the lock is dropped only when there's
    contention on the lock or need_resched() is true. Also, need_resched() is
    checked only when the lock is already held. The comment "give a chance to
    irqs before checking need_resched" is therefore misleading, as IRQs remain
    disabled when the check is done.

    This patch restores the behavior intended by commit b2eef8c0d091 and also
    tries to better balance and make more deterministic the time spent by
    checking for contention vs the time the scanners might run between the
    checks. It also avoids situations where checking has not been done often
    enough before. The result should be avoiding both too frequent and too
    infrequent contention checking, and especially the potentially
    long-running scans with IRQs disabled and no checking of need_resched() or
    for fatal signal pending, which can happen when many consecutive pages or
    pageblocks fail the preliminary tests and do not reach the later call site
    to compact_checklock_irqsave(), as explained below.

    Before the patch:

    In the migration scanner, compact_checklock_irqsave() was called each
    loop, if reached. If not reached, some lower-frequency checking could
    still be done if the lock was already held, but this would not result in
    aborting contended async compaction until reaching
    compact_checklock_irqsave() or end of pageblock. In the free scanner, it
    was similar but completely without the periodical checking, so lock can be
    potentially held until reaching the end of pageblock.

    After the patch, in both scanners:

    The periodical check is done as the first thing in the loop on each
    SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
    function, which always unlocks the lock (if locked) and aborts async
    compaction if scheduling is needed. It also aborts any type of compaction
    when a fatal signal is pending.

    The compact_checklock_irqsave() function is replaced with a slightly
    different compact_trylock_irqsave(). The biggest difference is that the
    function is not called at all if the lock is already held. The periodical
    need_resched() checking is left solely to compact_unlock_should_abort().
    The lock contention avoidance for async compaction is achieved by the
    periodical unlock by compact_unlock_should_abort() and by using trylock in
    compact_trylock_irqsave() and aborting when trylock fails. Sync
    compaction does not use trylock.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Async compaction aborts when it detects zone lock contention or
    need_resched() is true. David Rientjes has reported that in practice,
    most direct async compactions for THP allocation abort due to
    need_resched(). This means that a second direct compaction is never
    attempted, which might be OK for a page fault, but khugepaged is intended
    to attempt a sync compaction in such case and in these cases it won't.

    This patch replaces "bool contended" in compact_control with an int that
    distinguishes between aborting due to need_resched() and aborting due to
    lock contention. This allows propagating the abort through all compaction
    functions as before, but passing the abort reason up to
    __alloc_pages_slowpath() which decides when to continue with direct
    reclaim and another compaction attempt.

    Another problem is that try_to_compact_pages() did not act upon the
    reported contention (both need_resched() or lock contention) immediately
    and would proceed with another zone from the zonelist. When
    need_resched() is true, that means initializing another zone compaction,
    only to check again need_resched() in isolate_migratepages() and aborting.
    For zone lock contention, the unintended consequence is that the lock
    contended status reported back to the allocator is detrmined from the last
    zone where compaction was attempted, which is rather arbitrary.

    This patch fixes the problem in the following way:
    - async compaction of a zone aborting due to need_resched() or fatal signal
    pending means that further zones should not be tried. We report
    COMPACT_CONTENDED_SCHED to the allocator.
    - aborting zone compaction due to lock contention means we can still try
    another zone, since it has different set of locks. We report back
    COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
    it was aborted due to lock contention.

    As a result of these fixes, khugepaged will proceed with second sync
    compaction as intended, when the preceding async compaction aborted due to
    need_resched(). Page fault compactions aborting due to need_resched()
    will spare some cycles previously wasted by initializing another zone
    compaction only to abort again. Lock contention will be reported only
    when compaction in all zones aborted due to lock contention, and therefore
    it's not a good idea to try again after reclaim.

    In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
    has improved number of THP collapse allocations by 10%, which shows
    positive effect on khugepaged. The benchmark's success rates are
    unchanged as it is not recognized as khugepaged. Numbers of compact_stall
    and compact_fail events have however decreased by 20%, with
    compact_success still a bit improved, which is good. With benchmark
    configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
    collapse allocations, and only slight improvement in stalls and failures.

    [akpm@linux-foundation.org: fix warnings]
    Reported-by: David Rientjes
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The unification of the migrate and free scanner families of function has
    highlighted a difference in how the scanners ensure they only isolate
    pages of the intended zone. This is important for taking zone lock or lru
    lock of the correct zone. Due to nodes overlapping, it is however
    possible to encounter a different zone within the range of the zone being
    compacted.

    The free scanner, since its inception by commit 748446bb6b5a ("mm:
    compaction: memory compaction core"), has been checking the zone of the
    first valid page in a pageblock, and skipping the whole pageblock if the
    zone does not match.

    This checking was completely missing from the migration scanner at first,
    and later added by commit dc9086004b3d ("mm: compaction: check for
    overlapping nodes during isolation for migration") in a reaction to a bug
    report. But the zone comparison in migration scanner is done once per a
    single scanned page, which is more defensive and thus more costly than a
    check per pageblock.

    This patch unifies the checking done in both scanners to once per
    pageblock, through a new pageblock_pfn_to_page() function, which also
    includes pfn_valid() checks. It is more defensive than the current free
    scanner checks, as it checks both the first and last page of the
    pageblock, but less defensive by the migration scanner per-page checks.
    It assumes that node overlapping may result (on some architecture) in a
    boundary between two nodes falling into the middle of a pageblock, but
    that there cannot be a node0 node1 node0 interleaving within a single
    pageblock.

    The result is more code being shared and a bit less per-page CPU cost in
    the migration scanner.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • isolate_migratepages_range() is the main function of the compaction
    scanner, called either on a single pageblock by isolate_migratepages()
    during regular compaction, or on an arbitrary range by CMA's
    __alloc_contig_migrate_range(). It currently perfoms two pageblock-wide
    compaction suitability checks, and because of the CMA callpath, it tracks
    if it crossed a pageblock boundary in order to repeat those checks.

    However, closer inspection shows that those checks are always true for CMA:
    - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
    - migrate_async_suitable() check is skipped because CMA uses sync compaction

    We can therefore move the compaction-specific checks to
    isolate_migratepages() and simplify isolate_migratepages_range().
    Furthermore, we can mimic the freepage scanner family of functions, which
    has isolate_freepages_block() function called both by compaction from
    isolate_freepages() and by CMA from isolate_freepages_range(), where each
    use-case adds own specific glue code. This allows further code
    simplification.

    Thus, we rename isolate_migratepages_range() to
    isolate_migratepages_block() and limit its functionality to a single
    pageblock (or its subset). For CMA, a new different
    isolate_migratepages_range() is created as a CMA-specific wrapper for the
    _block() function. The checks specific to compaction are moved to
    isolate_migratepages(). As part of the unification of these two families
    of functions, we remove the redundant zone parameter where applicable,
    since zone pointer is already passed in cc->zone.

    Furthermore, going back to compact_zone() and compact_finished() when
    pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
    - the checks are meant to skip pageblocks quickly. The patch therefore
    also introduces a simple loop into isolate_migratepages() so that it does
    not return immediately on failed pageblock checks, but keeps going until
    isolate_migratepages_range() gets called once. Similarily to
    isolate_freepages(), the function periodically checks if it needs to
    reschedule or abort async compaction.

    [iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • isolate_freepages_block() rechecks if the pageblock is suitable to be a
    target for migration after it has taken the zone->lock. However, the
    check has been optimized to occur only once per pageblock, and
    compact_checklock_irqsave() might be dropping and reacquiring lock, which
    means somebody else might have changed the pageblock's migratetype
    meanwhile.

    Furthermore, nothing prevents the migratetype to change right after
    isolate_freepages_block() has finished isolating. Given how imperfect
    this is, it's simpler to just rely on the check done in
    isolate_freepages() without lock, and not pretend that the recheck under
    lock guarantees anything. It is just a heuristic after all.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Zhang Yanfei
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The compact_stall vmstat counter counts the number of allocations stalled
    by direct compaction. It does not count when all attempted zones had
    deferred compaction, but it does count when all zones skipped compaction.
    The skipping is decided based on very early check of
    compaction_suitable(), based on watermarks and memory fragmentation.
    Therefore it makes sense not to count skipped compactions as stalls.
    Moreover, compact_success or compact_fail is also already not being
    counted when compaction was skipped, so this patch changes the
    compact_stall counting to match the other two.

    Additionally, restructure __alloc_pages_direct_compact() code for better
    readability.

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • When direct sync compaction is often unsuccessful, it may become deferred
    for some time to avoid further useless attempts, both sync and async.
    Successful high-order allocations un-defer compaction, while further
    unsuccessful compaction attempts prolong the compaction deferred period.

    Currently the checking and setting deferred status is performed only on
    the preferred zone of the allocation that invoked direct compaction. But
    compaction itself is attempted on all eligible zones in the zonelist, so
    the behavior is suboptimal and may lead both to scenarios where 1)
    compaction is attempted uselessly, or 2) where it's not attempted despite
    good chances of succeeding, as shown on the examples below:

    1) A direct compaction with Normal preferred zone failed and set
    deferred compaction for the Normal zone. Another unrelated direct
    compaction with DMA32 as preferred zone will attempt to compact DMA32
    zone even though the first compaction attempt also included DMA32 zone.

    In another scenario, compaction with Normal preferred zone failed to
    compact Normal zone, but succeeded in the DMA32 zone, so it will not
    defer compaction. In the next attempt, it will try Normal zone which
    will fail again, instead of skipping Normal zone and trying DMA32
    directly.

    2) Kswapd will balance DMA32 zone and reset defer status based on
    watermarks looking good. A direct compaction with preferred Normal
    zone will skip compaction of all zones including DMA32 because Normal
    was still deferred. The allocation might have succeeded in DMA32, but
    won't.

    This patch makes compaction deferring work on individual zone basis
    instead of preferred zone. For each zone, it checks compaction_deferred()
    to decide if the zone should be skipped. If watermarks fail after
    compacting the zone, defer_compaction() is called. The zone where
    watermarks passed can still be deferred when the allocation attempt is
    unsuccessful. When allocation is successful, compaction_defer_reset() is
    called for the zone containing the allocated page. This approach should
    approximate calling defer_compaction() only on zones where compaction was
    attempted and did not yield allocated page. There might be corner cases
    but that is inevitable as long as the decision to stop compacting dues not
    guarantee that a page will be allocated.

    Due to a new COMPACT_DEFERRED return value, some functions relying
    implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
    more accurate. The did_some_progress output parameter of
    __alloc_pages_direct_compact() is removed completely, as the caller
    actually does not use it after compaction sets it - it is only considered
    when direct reclaim sets it.

    During testing on a two-node machine with a single very small Normal zone
    on node 1, this patch has improved success rates in stress-highalloc
    mmtests benchmark. The success here were previously made worse by commit
    3a025760fc15 ("mm: page_alloc: spill to remote nodes before waking
    kswapd") as kswapd was no longer resetting often enough the deferred
    compaction for the Normal zone, and DMA32 zones on both nodes were thus
    not considered for compaction. On different machine, success rates were
    improved with __GFP_NO_KSWAPD allocations.

    [akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
    Signed-off-by: Vlastimil Babka
    Acked-by: Minchan Kim
    Reviewed-by: Zhang Yanfei
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • When allocating huge page for collapsing, khugepaged currently holds
    mmap_sem for reading on the mm where collapsing occurs. Afterwards the
    read lock is dropped before write lock is taken on the same mmap_sem.

    Holding mmap_sem during whole huge page allocation is therefore useless,
    the vma needs to be rechecked after taking the write lock anyway.
    Furthemore, huge page allocation might involve a rather long sync
    compaction, and thus block any mmap_sem writers and i.e. affect workloads
    that perform frequent m(un)map or mprotect oterations.

    This patch simply releases the read lock before allocating a huge page.
    It also deletes an outdated comment that assumed vma must be stable, as it
    was using alloc_hugepage_vma(). This is no longer true since commit
    9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node").

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Sequential read from a block device is expected to be equal or faster than
    from the file on a filesystem. But it is not correct due to the lack of
    effective readpages() in the address space operations for block device.

    This implements readpages() operation for block device by using
    mpage_readpages() which can create multipage BIOs instead of BIOs for each
    page and reduce system CPU time consumption.

    Install 1GB of RAM disk storage:

    # modprobe scsi_debug dev_size_mb=1024 delay=0

    Sequential read from file on a filesystem:

    # mkfs.ext4 /dev/$DEV
    # mount /dev/$DEV /mnt
    # fio --name=t --size=512m --rw=read --filename=/mnt/file
    ...
    read : io=524288KB, bw=2133.4MB/s, iops=546133, runt= 240msec

    Sequential read from a block device:
    # fio --name=t --size=512m --rw=read --filename=/dev/$DEV
    ...
    (Without this commit)
    read : io=524288KB, bw=1700.2MB/s, iops=435455, runt= 301msec

    (With this commit)
    read : io=524288KB, bw=2160.4MB/s, iops=553046, runt= 237msec

    Signed-off-by: Akinobu Mita
    Cc: Jens Axboe
    Cc: Alexander Viro
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Add guard_bio_eod() check for mpage code in order to allow us to do IO
    even on the odd last sectors of a device, even if the block size is some
    multiple of the physical sector size.

    Using mpage_readpages() for block device requires this guard check.

    Signed-off-by: Akinobu Mita
    Cc: Jens Axboe
    Cc: Alexander Viro
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • This patchset implements readpages() operation for block device by using
    mpage_readpages() which can create multipage BIOs instead of BIOs for each
    page and reduce system CPU time consumption.

    This patch (of 3):

    guard_bh_eod() is used in submit_bh() to allow us to do IO even on the odd
    last sectors of a device, even if the block size is some multiple of the
    physical sector size. This makes guard_bh_eod() more generic and renames
    it guard_bio_eod() so that we can use it without struct buffer_head
    argument.

    The reason for this change is that using mpage_readpages() for block
    device requires to add this guard check in mpage code.

    Signed-off-by: Akinobu Mita
    Cc: Jens Axboe
    Cc: Alexander Viro
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • The check for ALLOC_CMA in __alloc_pages_nodemask() derives migratetype
    from gfp_mask in each retry pass, although the migratetype variable
    already has the value determined and it does not change. Use the variable
    and perform the check only once. Also convert #ifdef CONFIG_CMA to
    IS_ENABLED.

    Signed-off-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: "Srivatsa S. Bhat"
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • DMA-mapping supports CMA regions places either in low or high memory, so
    there is no longer needed to limit default CMA regions only to low memory.
    The real limit is still defined by architecture specific DMA limit.

    Signed-off-by: Marek Szyprowski
    Reported-by: Russell King - ARM Linux
    Acked-by: Michal Nazarewicz
    Cc: Daniel Drake
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marek Szyprowski
     
  • Russell King recently noticed that limiting default CMA region only to low
    memory on ARM architecture causes serious memory management issues with
    machines having a lot of memory (which is mainly available as high
    memory). More information can be found the following thread:
    http://thread.gmane.org/gmane.linux.ports.arm.kernel/348441/

    Those two patches removes this limit letting kernel to put default CMA
    region into high memory when this is possible (there is enough high memory
    available and architecture specific DMA limit fits).

    This should solve strange OOM issues on systems with lots of RAM (i.e.
    >1GiB) and large (>256M) CMA area.

    This patch (of 2):

    Automatically allocated regions should not cross low/high memory boundary,
    because such regions cannot be later correctly initialized due to spanning
    across two memory zones. This patch adds a check for this case and a
    simple code for moving region to low memory if automatically selected
    address might not fit completely into high memory.

    Signed-off-by: Marek Szyprowski
    Acked-by: Michal Nazarewicz
    Cc: Daniel Drake
    Cc: Minchan Kim
    Cc: Russell King
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marek Szyprowski
     
  • Neither CMA nor noncoherent allocations support atomic allocations.
    Add a dedicated atomic pool to support this.

    Reviewed-by: Catalin Marinas
    Signed-off-by: Laura Abbott
    Cc: Arnd Bergmann
    Cc: David Riley
    Cc: Olof Johansson
    Cc: Ritesh Harjain
    Cc: Russell King
    Cc: Thierry Reding
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • ARM currently uses a bitmap for tracking atomic allocations. genalloc
    already handles this type of memory pool allocation so switch to using
    that instead.

    Signed-off-by: Laura Abbott
    Reviewed-by: Catalin Marinas
    Cc: Arnd Bergmann
    Cc: David Riley
    Cc: Olof Johansson
    Cc: Ritesh Harjain
    Cc: Russell King
    Cc: Thierry Reding
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • For architectures without coherent DMA, memory for DMA may need to be
    remapped with coherent attributes. Factor out the the remapping code from
    arm and put it in a common location to reduce code duplication.

    As part of this, the arm APIs are now migrated away from
    ioremap_page_range to the common APIs which use map_vm_area for remapping.
    This should be an equivalent change and using map_vm_area is more correct
    as ioremap_page_range is intended to bring in io addresses into the cpu
    space and not regular kernel managed memory.

    Signed-off-by: Laura Abbott
    Reviewed-by: Catalin Marinas
    Cc: Arnd Bergmann
    Cc: David Riley
    Cc: Olof Johansson
    Cc: Ritesh Harjain
    Cc: Russell King
    Cc: Thierry Reding
    Cc: Will Deacon
    Cc: James Hogan
    Cc: Laura Abbott
    Cc: Mitchel Humpherys
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • After allocating an address from a particular genpool, there is no good
    way to verify if that address actually belongs to a genpool. Introduce
    addr_in_gen_pool which will return if an address plus size falls
    completely within the genpool range.

    Signed-off-by: Laura Abbott
    Acked-by: Will Deacon
    Reviewed-by: Olof Johansson
    Reviewed-by: Catalin Marinas
    Cc: Arnd Bergmann
    Cc: David Riley
    Cc: Ritesh Harjain
    Cc: Russell King
    Cc: Thierry Reding
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • One of the more common algorithms used for allocation is to align the
    start address of the allocation to the order of size requested. Add this
    as an algorithm option for genalloc.

    Signed-off-by: Laura Abbott
    Acked-by: Will Deacon
    Acked-by: Olof Johansson
    Reviewed-by: Catalin Marinas
    Cc: Arnd Bergmann
    Cc: David Riley
    Cc: Ritesh Harjain
    Cc: Russell King
    Cc: Thierry Reding
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
    _PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and
    relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
    fault scanner. This was found to be conceptually confusing with a lot of
    implicit assumptions and it was asked that an alternative be found.

    Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the
    PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap PTE
    bits and shrunk the maximum possible swap size but it did not go far
    enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA
    but the relics still exist.

    This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
    duplication in powerpc vs the generic implementation by defining the types
    the core NUMA helpers expected to exist from x86 with their ppc64
    equivalent. This necessitated that a PTE bit mask be created that
    identified the bits that distinguish present from NUMA pte entries but it
    is expected this will only differ between arches based on _PAGE_PROTNONE.
    The naming for the generic helpers was taken from x86 originally but ppc64
    has types that are equivalent for the purposes of the helper so they are
    mapped instead of duplicating code.

    Signed-off-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Cyrill Gorcunov
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently memory-hotplug has two limits:

    1. If the memory block is in ZONE_NORMAL, you can change it to
    ZONE_MOVABLE, but this memory block must be adjacent to ZONE_MOVABLE.

    2. If the memory block is in ZONE_MOVABLE, you can change it to
    ZONE_NORMAL, but this memory block must be adjacent to ZONE_NORMAL.

    With this patch, we can easy to know a memory block can be onlined to
    which zone, and don't need to know the above two limits.

    Updated the related Documentation.

    [akpm@linux-foundation.org: use conventional comment layout]
    [akpm@linux-foundation.org: fix build with CONFIG_MEMORY_HOTREMOVE=n]
    [akpm@linux-foundation.org: remove unused local zone_prev]
    Signed-off-by: Zhang Zhen
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Toshi Kani
    Cc: Yasuaki Ishimatsu
    Cc: Naoya Horiguchi
    Cc: Wang Nan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Zhen
     
  • Signed-off-by: vishnu.ps
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    vishnu.ps
     
  • Because of chicken and egg problem, initialization of SLAB is really
    complicated. We need to allocate cpu cache through SLAB to make the
    kmem_cache work, but before initialization of kmem_cache, allocation
    through SLAB is impossible.

    On the other hand, SLUB does initialization in a more simple way. It uses
    percpu allocator to allocate cpu cache so there is no chicken and egg
    problem.

    So, this patch try to use percpu allocator in SLAB. This simplifies the
    initialization step in SLAB so that we could maintain SLAB code more
    easily.

    In my testing there is no performance difference.

    This implementation relies on percpu allocator. Because percpu allocator
    uses vmalloc address space, vmalloc address space could be exhausted by
    this change on many cpu system with *32 bit* kernel. This implementation
    can cover 1024 cpus in worst case by following calculation.

    Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches *
    120 objects per cpu_cache = 140 MB
    Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) *
    80 objects per cpu_cache = 46 MB

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Jeremiah Mahler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Slab merge is good feature to reduce fragmentation. If new creating slab
    have similar size and property with exsitent slab, this feature reuse it
    rather than creating new one. As a result, objects are packed into fewer
    slabs so that fragmentation is reduced.

    Below is result of my testing.

    * After boot, sleep 20; cat /proc/meminfo | grep Slab

    Slab: 25136 kB

    Slab: 24364 kB

    We can save 3% memory used by slab.

    For supporting this feature in SLAB, we need to implement SLAB specific
    kmem_cache_flag() and __kmem_cache_alias(), because SLUB implements some
    SLUB specific processing related to debug flag and object size change on
    these functions.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Slab merge is good feature to reduce fragmentation. Now, it is only
    applied to SLUB, but, it would be good to apply it to SLAB. This patch is
    preparation step to apply slab merge to SLAB by commonizing slab merge
    logic.

    Signed-off-by: Joonsoo Kim
    Cc: Randy Dunlap
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Fix a bug (discovered with kmemcheck) in for_each_kmem_cache_node(). The
    for loop reads the array "node" before verifying that the index is within
    the range. This results in kmemcheck warning.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • …ask_struct allocations")

    After discussions with Tejun, we don't want to spread the use of
    cpu_to_mem() (and thus knowledge of allocators/NUMA topology details) into
    callers, but would rather ensure the callees correctly handle memoryless
    nodes. With the previous patches ("topology: add support for
    node_to_mem_node() to determine the fallback node" and "slub: fallback to
    node_to_mem_node() node if allocating on memoryless node") adding and
    using node_to_mem_node(), we can safely undo part of the change to the
    kthread logic from 81c98869faa5.

    Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Anton Blanchard <anton@samba.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
    Cc: Tejun Heo <tj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Nishanth Aravamudan
     
  • Update the SLUB code to search for partial slabs on the nearest node with
    memory in the presence of memoryless nodes. Additionally, do not consider
    it to be an ALLOC_NODE_MISMATCH (and deactivate the slab) when a
    memoryless-node specified allocation goes off-node.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Nishanth Aravamudan
    Cc: David Rientjes
    Cc: Han Pingtian
    Cc: Pekka Enberg
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Anton Blanchard
    Cc: Christoph Lameter
    Cc: Wanpeng Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that
    on ppc LPARs with memoryless nodes, a large amount of memory was consumed
    by slabs and was marked unreclaimable. He tracked it down to slab
    deactivations in the SLUB core when we allocate remotely, leading to poor
    efficiency always when memoryless nodes are present.

    After much discussion, Joonsoo provided a few patches that help
    significantly. They don't resolve the problem altogether:

    - memory hotplug still needs testing, that is when a memoryless node
    becomes memory-ful, we want to dtrt
    - there are other reasons for going off-node than memoryless nodes,
    e.g., fully exhausted local nodes

    Neither case is resolved with this series, but I don't think that should
    block their acceptance, as they can be explored/resolved with follow-on
    patches.

    The series consists of:

    [1/3] topology: add support for node_to_mem_node() to determine the
    fallback node

    [2/3] slub: fallback to node_to_mem_node() node if allocating on
    memoryless node

    - Joonsoo's patches to cache the nearest node with memory for each
    NUMA node

    [3/3] Partial revert of 81c98869faa5 (""kthread: ensure locality of
    task_struct allocations")

    - At Tejun's request, keep the knowledge of memoryless node fallback
    to the allocator core.

    This patch (of 3):

    We need to determine the fallback node in slub allocator if the allocation
    target node is memoryless node. Without it, the SLUB wrongly select the
    node which has no memory and can't use a partial slab, because of node
    mismatch. Introduced function, node_to_mem_node(X), will return a node Y
    with memory that has the nearest distance. If X is memoryless node, it
    will return nearest distance node, but, if X is normal node, it will
    return itself.

    We will use this function in following patch to determine the fallback
    node.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Nishanth Aravamudan
    Cc: David Rientjes
    Cc: Han Pingtian
    Cc: Pekka Enberg
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Anton Blanchard
    Cc: Christoph Lameter
    Cc: Wanpeng Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Tracing of mergeable slabs as well as uses of failslab are confusing since
    the objects of multiple slab caches will be affected. Moreover this
    creates a situation where a mergeable slab will become unmergeable.

    If tracing or failslab testing is desired then it may be best to switch
    merging off for starters.

    Signed-off-by: Christoph Lameter
    Tested-by: WANG Chao
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • cache_free_alien() is rarely used function when node mismatch. But, it is
    defined with inline attribute so it is inlined to __cache_free() which is
    core free function of slab allocator. It uselessly makes
    kmem_cache_free()/kfree() functions large. What we really need to inline
    is just checking node match so this patch factor out other parts of
    cache_free_alien() to reduce code size of kmem_cache_free()/ kfree().

    nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
    00000000000011e0 0000000000000228 T kfree
    0000000000000670 0000000000000216 T kmem_cache_free

    nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free"
    0000000000001110 00000000000001b5 T kfree
    0000000000000750 0000000000000181 T kmem_cache_free

    You can see slightly reduced size of text: 0x228->0x1b5, 0x216->0x181.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Our intention of __ac_put_obj() is that it doesn't affect anything if
    sk_memalloc_socks() is disabled. But, because __ac_put_obj() is too
    small, compiler inline it to ac_put_obj() and affect code size of free
    path. This patch add noinline keyword for __ac_put_obj() not to distrupt
    normal free path at all.

    nm -S slab-orig.o |
    grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"

    0000000000001e80 00000000000002f5 t cache_alloc_refill
    0000000000001230 0000000000000258 T kfree
    0000000000000690 000000000000024c T kmem_cache_free

    nm -S slab-patched.o |
    grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free"

    0000000000001e00 00000000000002e5 t cache_alloc_refill
    00000000000011e0 0000000000000228 T kfree
    0000000000000670 0000000000000216 T kmem_cache_free

    cache_alloc_refill: 0x2f5->0x2e5
    kfree: 0x256->0x228
    kmem_cache_free: 0x24c->0x216

    code size of each function is reduced slightly.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, due to likely keyword, compiled code of cache_flusharray() is on
    unlikely.text section. Although it is uncommon case compared to free to
    cpu cache case, it is common case than free_block(). But, free_block() is
    on normal text section. This patch fix this odd situation to remove
    likely keyword.

    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, we track caller if tracing or slab debugging is enabled. If they are
    disabled, we could save one argument passing overhead by calling
    __kmalloc(_node)(). But, I think that it would be marginal. Furthermore,
    default slab allocator, SLUB, doesn't use this technique so I think that
    it's okay to change this situation.

    After this change, we can turn on/off CONFIG_DEBUG_SLAB without full
    kernel build and remove some complicated '#if' defintion. It looks more
    benefitial to me.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We don't need to keep kmem_cache definition in include/linux/slab.h if we
    don't need to inline kmem_cache_size(). According to my code inspection,
    this function is only called at lc_create() in lib/lru_cache.c which may
    be called at initialization phase of something, so we don't need to inline
    it. Therfore, move it to slab_common.c and move kmem_cache definition to
    internal header.

    After this change, we can change kmem_cache definition easily without full
    kernel build. For instance, we can turn on/off CONFIG_SLUB_STATS without
    full kernel build.

    [akpm@linux-foundation.org: export kmem_cache_size() to modules]
    [rdunlap@infradead.org: add header files to fix kmemcheck.c build errors]
    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Zhang Yanfei
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim