26 Sep, 2014

40 commits

  • commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.

    This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Davidlohr Bueso
     
  • commit 83da7510058736c09a14b9c17ec7d851940a4332 upstream.

    Seems to be called with preemption enabled. Therefore it must use
    mod_zone_page_state instead.

    Signed-off-by: Christoph Lameter
    Reported-by: Grygorii Strashko
    Tested-by: Grygorii Strashko
    Cc: Tejun Heo
    Cc: Santosh Shilimkar
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Christoph Lameter
     
  • commit d5bc5fd3fcb7b8dfb431694a8c8052466504c10c upstream.

    The name `max_pass' is misleading, because this variable actually keeps
    the estimate number of freeable objects, not the maximal number of
    objects we can scan in this pass, which can be twice that. Rename it to
    reflect its actual meaning.

    Signed-off-by: Vladimir Davydov
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vladimir Davydov
     
  • commit 99120b772b52853f9a2b829a21dd44d9b20558f1 upstream.

    When direct reclaim is executed by a process bound to a set of NUMA
    nodes, we should scan only those nodes when possible, but currently we
    will scan kmem from all online nodes even if the kmem shrinker is NUMA
    aware. That said, binding a process to a particular NUMA node won't
    prevent it from shrinking inode/dentry caches from other nodes, which is
    not good. Fix this.

    Signed-off-by: Vladimir Davydov
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Dave Chinner
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vladimir Davydov
     
  • commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d upstream.

    In some testing I ran today (some fio jobs that spread over two nodes),
    we end up spending 40% of the time in filemap_check_errors(). That
    smells fishy. Looking further, this is basically what happens:

    blkdev_aio_read()
    generic_file_aio_read()
    filemap_write_and_wait_range()
    if (!mapping->nr_pages)
    filemap_check_errors()

    and filemap_check_errors() always attempts two test_and_clear_bit() on
    the mapping flags, thus dirtying it for every single invocation. The
    patch below tests each of these bits before clearing them, avoiding this
    issue. In my test case (4-socket box), performance went from 1.7M IOPS
    to 4.0M IOPS.

    Signed-off-by: Jens Axboe
    Acked-by: Jeff Moyer
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Jens Axboe
     
  • commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.

    Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 6d2be915e589b58cb11418cbe1f22ff90732b6ac upstream.

    Currently max_sane_readahead() returns zero on the cpu whose NUMA node
    has no local memory which leads to readahead failure. Fix this
    readahead failure by returning minimum of (requested pages, 512). Users
    running applications on a memory-less cpu which needs readahead such as
    streaming application see considerable boost in the performance.

    Result:

    fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless
    CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement.

    fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB
    testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on
    the normal NUMA cases w/ patch.

    Kernel Avg Stddev
    base 7.4975 3.92%
    patched 7.4174 3.26%

    [Andrew: making return value PAGE_SIZE independent]
    Suggested-by: Linus Torvalds
    Signed-off-by: Raghavendra K T
    Acked-by: Jan Kara
    Cc: Wu Fengguang
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Raghavendra K T
     
  • commit 91ca9186484809c57303b33778d841cc28f696ed upstream.

    The cached pageblock hint should be ignored when triggering compaction
    through /proc/sys/vm/compact_memory so all eligible memory is isolated.
    Manually invoking compaction is known to be expensive, there's no need
    to skip pageblocks based on heuristics (mainly for debugging).

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit da1c67a76f7cf2b3404823d24f9f10fa91aa5dc5 upstream.

    The conditions that control the isolation mode in
    isolate_migratepages_range() do not change during the iteration, so
    extract them out and only define the value once.

    This actually does have an effect, gcc doesn't optimize it itself because
    of cc->sync.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit b6c750163c0d138f5041d95fcdbd1094b6928057 upstream.

    It is just for clean-up to reduce code size and improve readability.
    There is no functional change.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Joonsoo Kim
     
  • commit c122b2087ab94192f2b937e47b563a9c4e688ece upstream.

    isolation_suitable() and migrate_async_suitable() is used to be sure
    that this pageblock range is fine to be migragted. It isn't needed to
    call it on every page. Current code do well if not suitable, but, don't
    do well when suitable.

    1) It re-checks isolation_suitable() on each page of a pageblock that was
    already estabilished as suitable.
    2) It re-checks migrate_async_suitable() on each page of a pageblock that
    was not entered through the next_pageblock: label, because
    last_pageblock_nr is not otherwise updated.

    This patch fixes situation by 1) calling isolation_suitable() only once
    per pageblock and 2) always updating last_pageblock_nr to the pageblock
    that was just checked.

    Additionally, move PageBuddy() check after pageblock unit check, since
    pageblock check is the first thing we should do and makes things more
    simple.

    [vbabka@suse.cz: rephrase commit description]
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Joonsoo Kim
     
  • commit be1aa03b973c7dcdc576f3503f7a60429825c35d upstream.

    It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
    pfn page. This may results in below situation while isolating
    migratepage.

    1. try isolate 0x0 ~ 0x200 pfn pages.
    2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
    the spinlock.
    3. Then, to complete isolating, retry to aquire the lock.

    I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
    criteria about dropping the lock. This has no harm 0x0 pfn, because, at
    this time, locked variable would be false.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Joonsoo Kim
     
  • commit 01ead5340bcf5f3a1cd2452c75516d0ef4d908d7 upstream.

    suitable_migration_target() checks that pageblock is suitable for
    migration target. In isolate_freepages_block(), it is called on every
    page and this is inefficient. So make it called once per pageblock.

    suitable_migration_target() also checks if page is highorder or not, but
    it's criteria for highorder is pageblock order. So calling it once
    within pageblock range has no problem.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Joonsoo Kim
     
  • commit 7d348b9ea64db0a315d777ce7d4b06697f946503 upstream.

    Purpose of compaction is to get a high order page. Currently, if we
    find high-order page while searching migration target page, we break it
    to order-0 pages and use them as migration target. It is contrary to
    purpose of compaction, so disallow high-order page to be used for
    migration target.

    Additionally, clean-up logic in suitable_migration_target() to simplify
    the code. There is no functional changes from this clean-up.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Joonsoo Kim
     
  • commit 119d6d59dcc0980dcd581fdadb6b2033b512a473 upstream.

    Page migration will fail for memory that is pinned in memory with, for
    example, get_user_pages(). In this case, it is unnecessary to take
    zone->lru_lock or isolating the page and passing it to page migration
    which will ultimately fail.

    This is a racy check, the page can still change from under us, but in
    that case we'll just fail later when attempting to move the page.

    This avoids very expensive memory compaction when faulting transparent
    hugepages after pinning a lot of memory with a Mellanox driver.

    On a 128GB machine and pinning ~120GB of memory, before this patch we
    see the enormous disparity in the number of page migration failures
    because of the pinning (from /proc/vmstat):

    compact_pages_moved 8450
    compact_pagemigrate_failed 15614415

    0.05% of pages isolated are successfully migrated and explicitly
    triggering memory compaction takes 102 seconds. After the patch:

    compact_pages_moved 9197
    compact_pagemigrate_failed 7

    99.9% of pages isolated are now successfully migrated in this
    configuration and memory compaction takes less than one second.

    Signed-off-by: David Rientjes
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Rik van Riel
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit 943dca1a1fcbccb58de944669b833fd38a6c809b upstream.

    Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
    9TB memory machine since onlining memory sections is too slow. And we
    found out setup_zone_migrate_reserve spent >90% of the time.

    The problem is, setup_zone_migrate_reserve scans all pageblocks
    unconditionally, but it is only necessary if the number of reserved
    block was reduced (i.e. memory hot remove).

    Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means
    that the number of reserved pageblocks is almost always unchanged.

    This patch adds zone->nr_migrate_reserve_block to maintain the number of
    MIGRATE_RESERVE pageblocks and it reduces the overhead of
    setup_zone_migrate_reserve dramatically. The following table shows time
    of onlining a memory section.

    Amount of memory | 128GB | 192GB | 256GB|
    ---------------------------------------------
    linux-3.12 | 23.9 | 31.4 | 44.5 |
    This patch | 8.3 | 8.3 | 8.6 |
    Mel's proposal patch | 10.9 | 19.2 | 31.3 |
    ---------------------------------------------
    (millisecond)

    128GB : 4 nodes and each node has 32GB of memory
    192GB : 6 nodes and each node has 32GB of memory
    256GB : 8 nodes and each node has 32GB of memory

    (*1) Mel proposed his idea by the following threads.
    https://lkml.org/lkml/2013/10/30/272

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Yasuaki Ishimatsu
    Reported-by: Yasuaki Ishimatsu
    Tested-by: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Yasuaki Ishimatsu
     
  • commit ec97097bca147d5718a5d2c024d1ec740b10096d upstream.

    If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
    once with nid=0, but currently it is not true: if node 0 is not set in
    the nodemask or if it is not online, we will not call such shrinkers at
    all. As a result some slabs will be left untouched under some
    circumstances. Let us fix it.

    Signed-off-by: Vladimir Davydov
    Reported-by: Dave Chinner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vladimir Davydov
     
  • commit 0b1fb40a3b1291f2f12f13f644ac95cf756a00e6 upstream.

    When reclaiming kmem, we currently don't scan slabs that have less than
    batch_size objects (see shrink_slab_node()):

    while (total_scan >= batch_size) {
    shrinkctl->nr_to_scan = batch_size;
    shrinker->scan_objects(shrinker, shrinkctl);
    total_scan -= batch_size;
    }

    If there are only a few shrinkers available, such a behavior won't cause
    any problems, because the batch_size is usually small, but if we have a
    lot of slab shrinkers, which is perfectly possible since FS shrinkers
    are now per-superblock, we can end up with hundreds of megabytes of
    practically unreclaimable kmem objects. For instance, mounting a
    thousand of ext2 FS images with a hundred of files in each and iterating
    over all the files using du(1) will result in about 200 Mb of FS caches
    that cannot be dropped even with the aid of the vm.drop_caches sysctl!

    This problem was initially pointed out by Glauber Costa [*]. Glauber
    proposed to fix it by making the shrink_slab() always take at least one
    pass, to put it simply, turning the scan loop above to a do{}while()
    loop. However, this proposal was rejected, because it could result in
    more aggressive and frequent slab shrinking even under low memory
    pressure when total_scan is naturally very small.

    This patch is a slightly modified version of Glauber's approach.
    Similarly to Glauber's patch, it makes shrink_slab() scan less than
    batch_size objects, but only if the total number of objects we want to
    scan (total_scan) is greater than the total number of objects available
    (max_pass). Since total_scan is biased as half max_pass if the current
    delta change is small:

    if (delta < max_pass / 4)
    total_scan = min(total_scan, max_pass / 2);

    this is only possible if we are scanning at high prio. That said, this
    patch shouldn't change the vmscan behaviour if the memory pressure is
    low, but if we are tight on memory, we will do our best by trying to
    reclaim all available objects, which sounds reasonable.

    [*] http://www.spinics.net/lists/cgroups/msg06913.html

    Signed-off-by: Vladimir Davydov
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Dave Chinner
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vladimir Davydov
     
  • commit 579f82901f6f41256642936d7e632f3979ad76d4 upstream.

    This is a patch to improve swap readahead algorithm. It's from Hugh and
    I slightly changed it.

    Hugh's original changelog:

    swapin readahead does a blind readahead, whether or not the swapin is
    sequential. This may be ok on harddisk, because large reads have
    relatively small costs, and if the readahead pages are unneeded they can
    be reclaimed easily - though, what if their allocation forced reclaim of
    useful pages? But on SSD devices large reads are more expensive than
    small ones: if the readahead pages are unneeded, reading them in caused
    significant overhead.

    This patch adds very simplistic random read detection. Stealing the
    PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
    vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
    simply looks at readahead's current success rate, and narrows or widens
    its readahead window accordingly. There is little science to its
    heuristic: it's about as stupid as can be whilst remaining effective.

    The table below shows elapsed times (in centiseconds) when running a
    single repetitive swapping load across a 1000MB mapping in 900MB ram
    with 1GB swap (the harddisk tests had taken painfully too long when I
    used mem=500M, but SSD shows similar results for that).

    Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
    Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
    which Shaohua showed to be defective; HughNew this Nov 14 patch, with
    page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
    patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
    (1-page reads: no readahead).

    HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for
    sequential access to the mapping, cycling five times around; Rand for
    the same number of random touches. Anon for a MAP_PRIVATE anon mapping;
    Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

    One weakness of Shaohua's vma/anon_vma approach was that it did not
    optimize Shmem: seen below. Konstantin's approach was perhaps mistuned,
    50% slower on Seq: did not compete and is not shown below.

    HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon 73921 76210 75611 76904 78191 121542
    Seq Shmem 73601 73176 73855 72947 74543 118322
    Rand Anon 895392 831243 871569 845197 846496 841680
    Rand Shmem 1058375 1053486 827935 764955 764376 756489

    SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon 24634 24198 24673 25107 21614 70018
    Seq Shmem 24959 24932 25052 25703 22030 69678
    Rand Anon 43014 26146 28075 25989 26935 25901
    Rand Shmem 45349 45215 28249 24268 24138 24332

    These tests are, of course, two extremes of a very simple case: under
    heavier mixed loads I've not yet observed any consistent improvement or
    degradation, and wider testing would be welcome.

    Shaohua Li:

    Test shows Vanilla is slightly better in sequential workload than Hugh's
    patch. I observed with Hugh's patch sometimes the readahead size is
    shrinked too fast (from 8 to 1 immediately) in sequential workload if
    there is no hit. And in such case, continuing doing readahead is good
    actually.

    I don't prepare a sophisticated algorithm for the sequential workload
    because so far we can't guarantee sequential accessed pages are swap out
    sequentially. So I slightly change Hugh's heuristic - don't shrink
    readahead size too fast.

    Here is my test result (unit second, 3 runs average):
    Vanilla Hugh New
    Seq 356 370 360
    Random 4525 2447 2444

    Attached graph is the swapin/swapout throughput I collected with 'vmstat
    2'. The first part is running a random workload (till around 1200 of
    the x-axis) and the second part is running a sequential workload.
    swapin and swapout throughput are almost identical in steady state in
    both workloads. These are expected behavior. while in Vanilla, swapin
    is much bigger than swapout especially in random workload (because wrong
    readahead).

    Original patches by: Shaohua Li and Konstantin Khlebnikov.

    [fengguang.wu@intel.com: swapin_nr_pages() can be static]
    Signed-off-by: Hugh Dickins
    Signed-off-by: Shaohua Li
    Signed-off-by: Fengguang Wu
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Shaohua Li
     
  • commit 55b7c4c99f6a448f72179297fe6432544f220063 upstream.

    Compaction used to start its migrate and free page scaners at the zone's
    lowest and highest pfn, respectively. Later, caching was introduced to
    remember the scanners' progress across compaction attempts so that
    pageblocks are not re-scanned uselessly. Additionally, pageblocks where
    isolation failed are marked to be quickly skipped when encountered again
    in future compactions.

    Currently, both the reset of cached pfn's and clearing of the pageblock
    skip information for a zone is done in __reset_isolation_suitable().
    This function gets called when:

    - compaction is restarting after being deferred
    - compact_blockskip_flush flag is set in compact_finished() when the scanners
    meet (and not again cleared when direct compaction succeeds in allocation)
    and kswapd acts upon this flag before going to sleep

    This behavior is suboptimal for several reasons:

    - when direct sync compaction is called after async compaction fails (in the
    allocation slowpath), it will effectively do nothing, unless kswapd
    happens to process the compact_blockskip_flush flag meanwhile. This is racy
    and goes against the purpose of sync compaction to more thoroughly retry
    the compaction of a zone where async compaction has failed.
    The restart-after-deferring path cannot help here as deferring happens only
    after the sync compaction fails. It is also done only for the preferred
    zone, while the compaction might be done for a fallback zone.

    - the mechanism of marking pageblock to be skipped has little value since the
    cached pfn's are reset only together with the pageblock skip flags. This
    effectively limits pageblock skip usage to parallel compactions.

    This patch changes compact_finished() so that cached pfn's are reset
    immediately when the scanners meet. Clearing pageblock skip flags is
    unchanged, as well as the other situations where cached pfn's are reset.
    This allows the sync-after-async compaction to retry pageblocks not
    marked as skipped, such as blocks !MIGRATE_MOVABLE blocks that async
    compactions now skips without marking them.

    Signed-off-by: Vlastimil Babka
    Cc: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit 50b5b094e683f8e51e82c6dfe97b1608cf97e6c0 upstream.

    Compaction temporarily marks pageblocks where it fails to isolate pages
    as to-be-skipped in further compactions, in order to improve efficiency.
    One of the reasons to fail isolating pages is that isolation is not
    attempted in pageblocks that are not of MIGRATE_MOVABLE (or CMA) type.

    The problem is that blocks skipped due to not being MIGRATE_MOVABLE in
    async compaction become skipped due to the temporary mark also in future
    sync compaction. Moreover, this may follow quite soon during
    __alloc_page_slowpath, without much time for kswapd to clear the
    pageblock skip marks. This goes against the idea that sync compaction
    should try to scan these blocks more thoroughly than the async
    compaction.

    The fix is to ensure in async compaction that these !MIGRATE_MOVABLE
    blocks are not marked to be skipped. Note this should not affect
    performance or locking impact of further async compactions, as skipping
    a block due to being !MIGRATE_MOVABLE is done soon after skipping a
    block marked to be skipped, both without locking.

    Signed-off-by: Vlastimil Babka
    Cc: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit de6c60a6c115acaa721cfd499e028a413d1fcbf3 upstream.

    Currently there are several functions to manipulate the deferred
    compaction state variables. The remaining case where the variables are
    touched directly is when a successful allocation occurs in direct
    compaction, or is expected to be successful in the future by kswapd.
    Here, the lowest order that is expected to fail is updated, and in the
    case of successful allocation, the deferred status and counter is reset
    completely.

    Create a new function compaction_defer_reset() to encapsulate this
    functionality and make it easier to understand the code. No functional
    change.

    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit 0eb927c0ab789d3d7d69f68acb850f69d4e7c36f upstream.

    The broad goal of the series is to improve allocation success rates for
    huge pages through memory compaction, while trying not to increase the
    compaction overhead. The original objective was to reintroduce
    capturing of high-order pages freed by the compaction, before they are
    split by concurrent activity. However, several bugs and opportunities
    for simple improvements were found in the current implementation, mostly
    through extra tracepoints (which are however too ugly for now to be
    considered for sending).

    The patches mostly deal with two mechanisms that reduce compaction
    overhead, which is caching the progress of migrate and free scanners,
    and marking pageblocks where isolation failed to be skipped during
    further scans.

    Patch 1 (from mgorman) adds tracepoints that allow calculate time spent in
    compaction and potentially debug scanner pfn values.

    Patch 2 encapsulates the some functionality for handling deferred compactions
    for better maintainability, without a functional change
    type is not determined without being actually needed.

    Patch 3 fixes a bug where cached scanner pfn's are sometimes reset only after
    they have been read to initialize a compaction run.

    Patch 4 fixes a bug where scanners meeting is sometimes not properly detected
    and can lead to multiple compaction attempts quitting early without
    doing any work.

    Patch 5 improves the chances of sync compaction to process pageblocks that
    async compaction has skipped due to being !MIGRATE_MOVABLE.

    Patch 6 improves the chances of sync direct compaction to actually do anything
    when called after async compaction fails during allocation slowpath.

    The impact of patches were validated using mmtests's stress-highalloc
    benchmark with mmtests's stress-highalloc benchmark on a x86_64 machine
    with 4GB memory.

    Due to instability of the results (mostly related to the bugs fixed by
    patches 2 and 3), 10 iterations were performed, taking min,mean,max
    values for success rates and mean values for time and vmstat-based
    metrics.

    First, the default GFP_HIGHUSER_MOVABLE allocations were tested with the
    patches stacked on top of v3.13-rc2. Patch 2 is OK to serve as baseline
    due to no functional changes in 1 and 2. Comments below.

    stress-highalloc
    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
    Success 1 Min 9.00 ( 0.00%) 10.00 (-11.11%) 43.00 (-377.78%) 43.00 (-377.78%) 33.00 (-266.67%)
    Success 1 Mean 27.50 ( 0.00%) 25.30 ( 8.00%) 45.50 (-65.45%) 45.90 (-66.91%) 46.30 (-68.36%)
    Success 1 Max 36.00 ( 0.00%) 36.00 ( 0.00%) 47.00 (-30.56%) 48.00 (-33.33%) 52.00 (-44.44%)
    Success 2 Min 10.00 ( 0.00%) 8.00 ( 20.00%) 46.00 (-360.00%) 45.00 (-350.00%) 35.00 (-250.00%)
    Success 2 Mean 26.40 ( 0.00%) 23.50 ( 10.98%) 47.30 (-79.17%) 47.60 (-80.30%) 48.10 (-82.20%)
    Success 2 Max 34.00 ( 0.00%) 33.00 ( 2.94%) 48.00 (-41.18%) 50.00 (-47.06%) 54.00 (-58.82%)
    Success 3 Min 65.00 ( 0.00%) 63.00 ( 3.08%) 85.00 (-30.77%) 84.00 (-29.23%) 85.00 (-30.77%)
    Success 3 Mean 76.70 ( 0.00%) 70.50 ( 8.08%) 86.20 (-12.39%) 85.50 (-11.47%) 86.00 (-12.13%)
    Success 3 Max 87.00 ( 0.00%) 86.00 ( 1.15%) 88.00 ( -1.15%) 87.00 ( 0.00%) 87.00 ( 0.00%)

    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
    User 6437.72 6459.76 5960.32 5974.55 6019.67
    System 1049.65 1049.09 1029.32 1031.47 1032.31
    Elapsed 1856.77 1874.48 1949.97 1994.22 1983.15

    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
    Minor Faults 253952267 254581900 250030122 250507333 250157829
    Major Faults 420 407 506 530 530
    Swap Ins 4 9 9 6 6
    Swap Outs 398 375 345 346 333
    Direct pages scanned 197538 189017 298574 287019 299063
    Kswapd pages scanned 1809843 1801308 1846674 1873184 1861089
    Kswapd pages reclaimed 1806972 1798684 1844219 1870509 1858622
    Direct pages reclaimed 197227 188829 298380 286822 298835
    Kswapd efficiency 99% 99% 99% 99% 99%
    Kswapd velocity 953.382 970.449 952.243 934.569 922.286
    Direct efficiency 99% 99% 99% 99% 99%
    Direct velocity 104.058 101.832 153.961 143.200 148.205
    Percentage direct scans 9% 9% 13% 13% 13%
    Zone normal velocity 347.289 359.676 348.063 339.933 332.983
    Zone dma32 velocity 710.151 712.605 758.140 737.835 737.507
    Zone dma velocity 0.000 0.000 0.000 0.000 0.000
    Page writes by reclaim 557.600 429.000 353.600 426.400 381.800
    Page writes file 159 53 7 79 48
    Page writes anon 398 375 345 346 333
    Page reclaim immediate 825 644 411 575 420
    Sector Reads 2781750 2769780 2878547 2939128 2910483
    Sector Writes 12080843 12083351 12012892 12002132 12010745
    Page rescued immediate 0 0 0 0 0
    Slabs scanned 1575654 1545344 1778406 1786700 1794073
    Direct inode steals 9657 10037 15795 14104 14645
    Kswapd inode steals 46857 46335 50543 50716 51796
    Kswapd skipped wait 0 0 0 0 0
    THP fault alloc 97 91 81 71 77
    THP collapse alloc 456 506 546 544 565
    THP splits 6 5 5 4 4
    THP fault fallback 0 1 0 0 0
    THP collapse fail 14 14 12 13 12
    Compaction stalls 1006 980 1537 1536 1548
    Compaction success 303 284 562 559 578
    Compaction failures 702 696 974 976 969
    Page migrate success 1177325 1070077 3927538 3781870 3877057
    Page migrate failure 0 0 0 0 0
    Compaction pages isolated 2547248 2306457 8301218 8008500 8200674
    Compaction migrate scanned 42290478 38832618 153961130 154143900 159141197
    Compaction free scanned 89199429 79189151 356529027 351943166 356326727
    Compaction cost 1566 1426 5312 5156 5294
    NUMA PTE updates 0 0 0 0 0
    NUMA hint faults 0 0 0 0 0
    NUMA hint local faults 0 0 0 0 0
    NUMA hint local percent 100 100 100 100 100
    NUMA pages migrated 0 0 0 0 0
    AutoNUMA cost 0 0 0 0 0

    Observations:

    - The "Success 3" line is allocation success rate with system idle
    (phases 1 and 2 are with background interference). I used to get stable
    values around 85% with vanilla 3.11. The lower min and mean values came
    with 3.12. This was bisected to commit 81c0a2bb ("mm: page_alloc: fair
    zone allocator policy") As explained in comment for patch 3, I don't
    think the commit is wrong, but that it makes the effect of compaction
    bugs worse. From patch 3 onwards, the results are OK and match the 3.11
    results.

    - Patch 4 also clearly helps phases 1 and 2, and exceeds any results
    I've seen with 3.11 (I didn't measure it that thoroughly then, but it
    was never above 40%).

    - Compaction cost and number of scanned pages is higher, especially due
    to patch 4. However, keep in mind that patches 3 and 4 fix existing
    bugs in the current design of compaction overhead mitigation, they do
    not change it. If overhead is found unacceptable, then it should be
    decreased differently (and consistently, not due to random conditions)
    than the current implementation does. In contrast, patches 5 and 6
    (which are not strictly bug fixes) do not increase the overhead (but
    also not success rates). This might be a limitation of the
    stress-highalloc benchmark as it's quite uniform.

    Another set of results is when configuring stress-highalloc t allocate
    with similar flags as THP uses:
    (GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD)

    stress-highalloc
    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-thp 3-thp 4-thp 5-thp 6-thp
    Success 1 Min 2.00 ( 0.00%) 7.00 (-250.00%) 18.00 (-800.00%) 19.00 (-850.00%) 26.00 (-1200.00%)
    Success 1 Mean 19.20 ( 0.00%) 17.80 ( 7.29%) 29.20 (-52.08%) 29.90 (-55.73%) 32.80 (-70.83%)
    Success 1 Max 27.00 ( 0.00%) 29.00 ( -7.41%) 35.00 (-29.63%) 36.00 (-33.33%) 37.00 (-37.04%)
    Success 2 Min 3.00 ( 0.00%) 8.00 (-166.67%) 21.00 (-600.00%) 21.00 (-600.00%) 32.00 (-966.67%)
    Success 2 Mean 19.30 ( 0.00%) 17.90 ( 7.25%) 32.20 (-66.84%) 32.60 (-68.91%) 35.70 (-84.97%)
    Success 2 Max 27.00 ( 0.00%) 30.00 (-11.11%) 36.00 (-33.33%) 37.00 (-37.04%) 39.00 (-44.44%)
    Success 3 Min 62.00 ( 0.00%) 62.00 ( 0.00%) 85.00 (-37.10%) 75.00 (-20.97%) 64.00 ( -3.23%)
    Success 3 Mean 66.30 ( 0.00%) 65.50 ( 1.21%) 85.60 (-29.11%) 83.40 (-25.79%) 83.50 (-25.94%)
    Success 3 Max 70.00 ( 0.00%) 69.00 ( 1.43%) 87.00 (-24.29%) 86.00 (-22.86%) 87.00 (-24.29%)

    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-thp 3-thp 4-thp 5-thp 6-thp
    User 6547.93 6475.85 6265.54 6289.46 6189.96
    System 1053.42 1047.28 1043.23 1042.73 1038.73
    Elapsed 1835.43 1821.96 1908.67 1912.74 1956.38

    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-thp 3-thp 4-thp 5-thp 6-thp
    Minor Faults 256805673 253106328 253222299 249830289 251184418
    Major Faults 395 375 423 434 448
    Swap Ins 12 10 10 12 9
    Swap Outs 530 537 487 455 415
    Direct pages scanned 71859 86046 153244 152764 190713
    Kswapd pages scanned 1900994 1870240 1898012 1892864 1880520
    Kswapd pages reclaimed 1897814 1867428 1894939 1890125 1877924
    Direct pages reclaimed 71766 85908 153167 152643 190600
    Kswapd efficiency 99% 99% 99% 99% 99%
    Kswapd velocity 1029.000 1067.782 1000.091 991.049 951.218
    Direct efficiency 99% 99% 99% 99% 99%
    Direct velocity 38.897 49.127 80.747 79.983 96.468
    Percentage direct scans 3% 4% 7% 7% 9%
    Zone normal velocity 351.377 372.494 348.910 341.689 335.310
    Zone dma32 velocity 716.520 744.414 731.928 729.343 712.377
    Zone dma velocity 0.000 0.000 0.000 0.000 0.000
    Page writes by reclaim 669.300 604.000 545.700 538.900 429.900
    Page writes file 138 66 58 83 14
    Page writes anon 530 537 487 455 415
    Page reclaim immediate 806 655 772 548 517
    Sector Reads 2711956 2703239 2811602 2818248 2839459
    Sector Writes 12163238 12018662 12038248 11954736 11994892
    Page rescued immediate 0 0 0 0 0
    Slabs scanned 1385088 1388364 1507968 1513292 1558656
    Direct inode steals 1739 2564 4622 5496 6007
    Kswapd inode steals 47461 46406 47804 48013 48466
    Kswapd skipped wait 0 0 0 0 0
    THP fault alloc 110 82 84 69 70
    THP collapse alloc 445 482 467 462 539
    THP splits 6 5 4 5 3
    THP fault fallback 3 0 0 0 0
    THP collapse fail 15 14 14 14 13
    Compaction stalls 659 685 1033 1073 1111
    Compaction success 222 225 410 427 456
    Compaction failures 436 460 622 646 655
    Page migrate success 446594 439978 1085640 1095062 1131716
    Page migrate failure 0 0 0 0 0
    Compaction pages isolated 1029475 1013490 2453074 2482698 2565400
    Compaction migrate scanned 9955461 11344259 24375202 27978356 30494204
    Compaction free scanned 27715272 28544654 80150615 82898631 85756132
    Compaction cost 552 555 1344 1379 1436
    NUMA PTE updates 0 0 0 0 0
    NUMA hint faults 0 0 0 0 0
    NUMA hint local faults 0 0 0 0 0
    NUMA hint local percent 100 100 100 100 100
    NUMA pages migrated 0 0 0 0 0
    AutoNUMA cost 0 0 0 0 0

    There are some differences from the previous results for THP-like allocations:

    - Here, the bad result for unpatched kernel in phase 3 is much more
    consistent to be between 65-70% and not related to the "regression" in
    3.12. Still there is the improvement from patch 4 onwards, which brings
    it on par with simple GFP_HIGHUSER_MOVABLE allocations.

    - Compaction costs have increased, but nowhere near as much as the
    non-THP case. Again, the patches should be worth the gained
    determininsm.

    - Patches 5 and 6 somewhat increase the number of migrate-scanned pages.
    This is most likely due to __GFP_NO_KSWAPD flag, which means the cached
    pfn's and pageblock skip bits are not reset by kswapd that often (at
    least in phase 3 where no concurrent activity would wake up kswapd) and
    the patches thus help the sync-after-async compaction. It doesn't
    however show that the sync compaction would help so much with success
    rates, which can be again seen as a limitation of the benchmark
    scenario.

    This patch (of 6):

    Add two tracepoints for compaction begin and end of a zone. Using this it
    is possible to calculate how much time a workload is spending within
    compaction and potentially debug problems related to cached pfns for
    scanning. In combination with the direct reclaim and slab trace points it
    should be possible to estimate most allocation-related overhead for a
    workload.

    Signed-off-by: Mel Gorman
    Signed-off-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 71b54f8263860a37dd9f50f81880a9d681fd9c10 upstream.

    When choosing between doing an address space or ranged flush,
    the x86 implementation of flush_tlb_mm_range takes into account
    whether there are any large pages in the range. A per-page
    flush typically requires fewer entries than would covered by a
    single large page and the check is redundant.

    There is one potential exception. THP migration flushes single
    THP entries and it conceivably would benefit from flushing a
    single entry instead of the mm. However, this flush is after a
    THP allocation, copy and page table update potentially with any
    other threads serialised behind it. In comparison to that, the
    flush is noise. It makes more sense to optimise balancing to
    require fewer flushes than to optimise the flush itself.

    This patch deletes the redundant huge page check.

    Signed-off-by: Mel Gorman
    Tested-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Cc: Alex Shi
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-sgei1drpOcburujPsfh6ovmo@git.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 15aa368255f249df0b2af630c9487bb5471bd7da upstream.

    NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and
    the comparison with total_vm is done before taking
    tlb_flushall_shift into account. Clean it up.

    Signed-off-by: Mel Gorman
    Tested-by: Davidlohr Bueso
    Reviewed-by: Alex Shi
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/n/tip-Iz5gcahrgskIldvukulzi0hh@git.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit ec65993443736a5091b68e80ff1734548944a4b8 upstream.

    Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
    vmstats: tlb flush counters") to cause overhead problems.

    The counters are undeniably useful but how often do we really
    need to debug TLB flush related issues? It does not justify
    taking the penalty everywhere so make it a debugging option.

    Signed-off-by: Mel Gorman
    Tested-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Alex Shi
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 0cbef29a782162a3896487901eca4550bfa397ef upstream.

    When __rmqueue_fallback() doesn't find a free block with the required size
    it splits a larger page and puts the rest of the page onto the free list.

    But it has one serious mistake. When putting back, __rmqueue_fallback()
    always use start_migratetype if type is not CMA. However,
    __rmqueue_fallback() is only called when all of the start_migratetype
    queue is empty. That said, __rmqueue_fallback always puts back memory to
    the wrong queue except try_to_steal_freepages() changed pageblock type
    (i.e. requested size is smaller than half of page block). The end result
    is that the antifragmentation framework increases fragmenation instead of
    decreasing it.

    Mel's original anti fragmentation does the right thing. But commit
    47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.

    This patch restores sane and old behavior. It also removes an incorrect
    comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c:
    restructure free-page stealing code and fix a bug").

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    KOSAKI Motohiro
     
  • commit 52c8f6a5aeb0bdd396849ecaa72d96f8175528f5 upstream.

    In general, every tracepoint should be zero overhead if it is disabled.
    However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate
    "new_type == start_migratetype" even if tracepoint is disabled.

    However, the code can be moved into tracepoint's TP_fast_assign() and
    TP_fast_assign exist exactly such purpose. This patch does it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    KOSAKI Motohiro
     
  • commit af248a0c67457e5c6d2bcf288f07b4b2ed064f1f upstream.

    The kernel's readahead algorithm sometimes interprets random read
    accesses as sequential and triggers unnecessary data prefecthing from
    storage device (impacting random read average latency).

    In order to identify sequential cache read misses, the readahead
    algorithm intends to check whether offset - previous offset == 1
    (trivial sequential reads) or offset - previous offset == 0 (sequential
    reads not aligned on page boundary):

    if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) current offset (which happens on random pattern), the if
    condition is true and access is wrongly interpeted as sequential. An
    unnecessary data prefetching is triggered, impacting the average random
    read latency.

    Storing the previous offset value in a "pgoff_t" variable (unsigned
    long) fixes the sequential read detection logic.

    Signed-off-by: Damien Ramonda
    Reviewed-by: Fengguang Wu
    Acked-by: Pierre Tardy
    Acked-by: David Cohen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Damien Ramonda
     
  • commit 18ab4d4ced0817421e6db6940374cc39d28d65da upstream.

    Originally get_swap_page() started iterating through the singly-linked
    list of swap_info_structs using swap_list.next or highest_priority_index,
    which both were intended to point to the highest priority active swap
    target that was not full. The first patch in this series changed the
    singly-linked list to a doubly-linked list, and removed the logic to start
    at the highest priority non-full entry; it starts scanning at the highest
    priority entry each time, even if the entry is full.

    Replace the manually ordered swap_list_head with a plist, swap_active_head.
    Add a new plist, swap_avail_head. The original swap_active_head plist
    contains all active swap_info_structs, as before, while the new
    swap_avail_head plist contains only swap_info_structs that are active and
    available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect
    the swap_avail_head list.

    Mel Gorman suggested using plists since they internally handle ordering
    the list entries based on priority, which is exactly what swap was doing
    manually. All the ordering code is now removed, and swap_info_struct
    entries and simply added to their corresponding plist and automatically
    ordered correctly.

    Using a new plist for available swap_info_structs simplifies and
    optimizes get_swap_page(), which no longer has to iterate over full
    swap_info_structs. Using a new spinlock for swap_avail_head plist
    allows each swap_info_struct to add or remove themselves from the
    plist when they become full or not-full; previously they could not
    do so because the swap_info_struct->lock is held when they change
    from fullnot-full, and the swap_lock protecting the main
    swap_active_head must be ordered before any swap_info_struct->lock.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Dan Streetman
     
  • commit a75f232ce0fe38bd01301899ecd97ffd0254316a upstream.

    Add plist_requeue(), which moves the specified plist_node after all other
    same-priority plist_nodes in the list. This is essentially an optimized
    plist_del() followed by plist_add().

    This is needed by swap, which (with the next patch in this set) uses a
    plist of available swap devices. When a swap device (either a swap
    partition or swap file) are added to the system with swapon(), the device
    is added to a plist, ordered by the swap device's priority. When swap
    needs to allocate a page from one of the swap devices, it takes the page
    from the first swap device on the plist, which is the highest priority
    swap device. The swap device is left in the plist until all its pages are
    used, and then removed from the plist when it becomes full.

    However, as described in man 2 swapon, swap must allocate pages from swap
    devices with the same priority in round-robin order; to do this, on each
    swap page allocation, swap uses a page from the first swap device in the
    plist, and then calls plist_requeue() to move that swap device entry to
    after any other same-priority swap devices. The next swap page allocation
    will again use a page from the first swap device in the plist and requeue
    it, and so on, resulting in round-robin usage of equal-priority swap
    devices.

    Also add plist_test_requeue() test function, for use by plist_test() to
    test plist_requeue() function.

    Signed-off-by: Dan Streetman
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Acked-by: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Dan Streetman
     
  • commit fd16618e12a05df79a3439d72d5ffdac5d34f3da upstream.

    Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
    define and initialize a struct plist_head.

    Add plist_for_each_continue() and plist_for_each_entry_continue(),
    equivalent to list_for_each_continue() and list_for_each_entry_continue(),
    to iterate over a plist continuing after the current position.

    Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
    and ->next, implemented by list_prev_entry() and list_next_entry(), to
    access the prev/next struct plist_node entry. These are needed because
    unlike struct list_head, direct access of the prev/next struct plist_node
    isn't possible; the list must be navigated via the contained struct
    list_head. e.g. instead of accessing the prev by list_prev_entry(node,
    node_list) it can be accessed by plist_prev(node).

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Dan Streetman
     
  • commit adfab836f4908deb049a5128082719e689eed964 upstream.

    The logic controlling the singly-linked list of swap_info_struct entries
    for all active, i.e. swapon'ed, swap targets is rather complex, because:

    - it stores the entries in priority order
    - there is a pointer to the highest priority entry
    - there is a pointer to the highest priority not-full entry
    - there is a highest_priority_index variable set outside the swap_lock
    - swap entries of equal priority should be used equally

    this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
    where different priority swap targets are incorrectly used equally.

    That bug probably could be solved with the existing singly-linked lists,
    but I think it would only add more complexity to the already difficult to
    understand get_swap_page() swap_list iteration logic.

    The first patch changes from a singly-linked list to a doubly-linked list
    using list_heads; the highest_priority_index and related code are removed
    and get_swap_page() starts each iteration at the highest priority
    swap_info entry, even if it's full. While this does introduce unnecessary
    list iteration (i.e. Schlemiel the painter's algorithm) in the case where
    one or more of the highest priority entries are full, the iteration and
    manipulation code is much simpler and behaves correctly re: the above bug;
    and the fourth patch removes the unnecessary iteration.

    The second patch adds some minor plist helper functions; nothing new
    really, just functions to match existing regular list functions. These
    are used by the next two patches.

    The third patch adds plist_requeue(), which is used by get_swap_page() in
    the next patch - it performs the requeueing of same-priority entries
    (which moves the entry to the end of its priority in the plist), so that
    all equal-priority swap_info_structs get used equally.

    The fourth patch converts the main list into a plist, and adds a new plist
    that contains only swap_info entries that are both active and not full.
    As Mel suggested using plists allows removing all the ordering code from
    swap - plists handle ordering automatically. The list naming is also
    clarified now that there are two lists, with the original list changed
    from swap_list_head to swap_active_head and the new list named
    swap_avail_head. A new spinlock is also added for the new list, so
    swap_info entries can be added or removed from the new list immediately as
    they become full or not full.

    This patch (of 4):

    Replace the singly-linked list tracking active, i.e. swapon'ed,
    swap_info_struct entries with a doubly-linked list using struct
    list_heads. Simplify the logic iterating and manipulating the list of
    entries, especially get_swap_page(), by using standard list_head
    functions, and removing the highest priority iteration logic.

    The change fixes the bug:
    https://lkml.org/lkml/2014/2/13/181
    in which different priority swap entries after the highest priority entry
    are incorrectly used equally in pairs. The swap behavior is now as
    advertised, i.e. different priority swap entries are used in order, and
    equal priority swap targets are used concurrently.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Dan Streetman
     
  • commit 70ef57e6c22c3323dce179b7d0d433c479266612 upstream.

    We had a report about strange OOM killer strikes on a PPC machine
    although there was a lot of swap free and a tons of anonymous memory
    which could be swapped out. In the end it turned out that the OOM was a
    side effect of zone reclaim which wasn't unmapping and swapping out and
    so the system was pushed to the OOM. Although this sounds like a bug
    somewhere in the kswapd vs. zone reclaim vs. direct reclaim
    interaction numactl on the said hardware suggests that the zone reclaim
    should not have been set in the first place:

    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 2 cpus:
    node 2 size: 7168 MB
    node 2 free: 6019 MB
    node distances:
    node 0 2
    0: 10 40
    2: 40 10

    So all the CPUs are associated with Node0 which doesn't have any memory
    while Node2 contains all the available memory. Node distances cause an
    automatic zone_reclaim_mode enabling.

    Zone reclaim is intended to keep the allocations local but this doesn't
    make any sense on the memoryless nodes. So let's exclude such nodes for
    init_zone_allows_reclaim which evaluates zone reclaim behavior and
    suitable reclaim_nodes.

    Signed-off-by: Michal Hocko
    Acked-by: David Rientjes
    Acked-by: Nishanth Aravamudan
    Tested-by: Nishanth Aravamudan
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Michal Hocko
     
  • commit 457c1b27ed56ec472d202731b12417bff023594a upstream.

    Currently, I am seeing the following when I `mount -t hugetlbfs /none
    /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
    related to the fact that hugetlbfs is properly not correctly setting
    itself up in this state?:

    Unable to handle kernel paging request for data at address 0x00000031
    Faulting instruction address: 0xc000000000245710
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    ....

    In KVM guests on Power, in a guest not backed by hugepages, we see the
    following:

    AnonHugePages: 0 kB
    HugePages_Total: 0
    HugePages_Free: 0
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 64 kB

    HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
    are not supported at boot-time, but this is only checked in
    hugetlb_init(). Extract the check to a helper function, and use it in a
    few relevant places.

    This does make hugetlbfs not supported (not registered at all) in this
    environment. I believe this is fine, as there are no valid hugepages
    and that won't change at runtime.

    [akpm@linux-foundation.org: use pr_info(), per Mel]
    [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
    Signed-off-by: Nishanth Aravamudan
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Mel Gorman
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Nishanth Aravamudan
     
  • commit 887843961c4b4681ee993c36d4997bf4b4aa8253 upstream.

    Fix some "Bad rss-counter state" reports on exit, arising from the
    interaction between page migration and remap_file_pages(): zap_pte()
    must count a migration entry when zapping it.

    And yes, it is possible (though very unusual) to find an anon page or
    swap entry in a VM_SHARED nonlinear mapping: coming from that horrid
    get_user_pages(write, force) case which COWs even in a shared mapping.

    Signed-off-by: Hugh Dickins
    Tested-by: Sasha Levin sasha.levin@oracle.com>
    Tested-by: Dave Jones davej@redhat.com>
    Cc: Cyrill Gorcunov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Hugh Dickins
     
  • commit da8c757b080ee84f219fa2368cb5dd23ac304fc0 upstream.

    If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang. Changing
    proc_dointvec() to proc_dointvec_minmax() in the
    min_free_kbytes_sysctl_handler() can prevent this to happen.

    mhocko said:

    : You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make
    : your machine unusable but I agree that proc_dointvec_minmax is more
    : suitable here as we already have:
    :
    : .proc_handler = min_free_kbytes_sysctl_handler,
    : .extra1 = &zero,
    :
    : It used to work properly but then 6fce56ec91b5 ("sysctl: Remove references
    : to ctl_name and strategy from the generic sysctl table") has removed
    : sysctl_intvec strategy and so extra1 is ignored.

    Signed-off-by: Han Pingtian
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Han Pingtian
     
  • commit 73293c2f900d0adbb6a415b312cd57976d5ae242 upstream.

    We checked pfmemalloc by slab unit, not page unit. You can see this
    in is_slab_pfmemalloc(). So other pages don't need to be set/cleared
    pfmemalloc.

    And, therefore we should check pfmemalloc in page flag of first page,
    but current implementation don't do that. virt_to_head_page(obj) just
    return 'struct page' of that object, not one of first page, since the SLAB
    don't use __GFP_COMP when CONFIG_MMU. To get 'struct page' of first page,
    we first get a slab and try to get it via virt_to_head_page(slab->s_mem).

    Acked-by: Andi Kleen
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Pekka Enberg
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Joonsoo Kim
     
  • commit 9f1b868a13ac36bd207a571f5ea1193d823ab18d upstream.

    Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a
    hugepage which is allocated from the node of the first scanned normal
    page, but this policy is too rough and may end with unexpected result to
    upper users.

    The problem is the original page-balancing among all nodes will be
    broken after hugepaged started. Thinking about the case if the first
    scanned normal page is allocated from node A, most of other scanned
    normal pages are allocated from node B or C.. But hugepaged will always
    allocate hugepage from node A which will cause extra memory pressure on
    node A which is not the situation before khugepaged started.

    This patch try to fix this problem by making khugepaged allocate
    hugepage from the node which have max record of scaned normal pages hit,
    so that the effect to original page-balancing can be minimized.

    The other problem is if normal scanned pages are equally allocated from
    Node A,B and C, after khugepaged started Node A will still suffer extra
    memory pressure.

    Andrew Davidoff reported a related issue several days ago. He wanted
    his application interleaving among all nodes and "numactl
    --interleave=all ./test" was used to run the testcase, but the result
    wasn't not as expected.

    cat /proc/2814/numa_maps:
    7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435 N3=50098

    The end result showed that most pages are from Node3 instead of
    interleave among node0-3 which was unreasonable.

    This patch also fix this issue by allocating hugepage round robin from
    all nodes have the same record, after this patch the result was as
    expected:

    7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235 N3=12722

    The simple testcase is like this:

    int main() {
    char *p;
    int i;
    int j;

    for (i=0; i < 200; i++) {
    p = (char *)malloc(1048576);
    printf("malloc done\n");

    if (p == 0) {
    printf("Out of memory\n");
    return 1;
    }
    for (j=0; j < 1048576; j++) {
    p[j] = 'A';
    }
    printf("touched memory\n");

    sleep(1);
    }
    printf("enter sleep\n");
    while(1) {
    sleep(100);
    }
    }

    [akpm@linux-foundation.org: make last_khugepaged_target_node local to khugepaged_find_target_node()]
    Reported-by: Andrew Davidoff
    Tested-by: Andrew Davidoff
    Signed-off-by: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Yasuaki Ishimatsu
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Bob Liu
     
  • commit 10dc4155c7714f508fe2e4667164925ea971fb25 upstream.

    Move alloc_hugepage() to a better place, no need for a seperate #ifndef
    CONFIG_NUMA

    Signed-off-by: Bob Liu
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Andrew Davidoff
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Bob Liu