26 Sep, 2014

28 commits

  • commit b745bc85f21ea707e4ea1a91948055fa3e72c77b upstream.

    cold is a bool, make it one. Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit dc4b0caff24d9b2918e9f27bc65499ee63187eba upstream.

    In the free path we calculate page_to_pfn multiple times. Reduce that.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 7aeb09f9104b760fc53c98cb7d20d06640baf9e6 upstream.

    X86 prefers the use of unsigned types for iterators and there is a
    tendency to mix whether a signed or unsigned type if used for page order.
    This converts a number of sites in mm/page_alloc.c to use unsigned int for
    order where possible.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 664eeddeef6539247691197c1ac124d4aa872ab6 upstream.

    If cpusets are not in use then we still check a global variable on every
    page allocation. Use jump labels to avoid the overhead.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit ea5e9539abf1258f23e725cb9cb25aa74efa29eb upstream.

    This patch exposes the jump_label reference count in preparation for the
    next patch. cpusets cares about both the jump_label being enabled and how
    many users of the cpusets there currently are.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 2329d3751b082b4fd354f334a88662d72abac52d upstream.

    In mm/swap.c, __lru_cache_add() is exported, but actually there are no
    users outside this file.

    This patch unexports __lru_cache_add(), and makes it static. It also
    exports lru_cache_add_file(), as it is use by cifs and fuse, which can
    loaded as modules.

    Signed-off-by: Jianyu Zhan
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Bob Liu
    Cc: Seth Jennings
    Cc: Joonsoo Kim
    Cc: Rafael Aquini
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Khalid Aziz
    Cc: Christoph Hellwig
    Reviewed-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Jianyu Zhan
     
  • commit f8c9301fa5a2a8b873c67f2a3d8230d5c13f61b7 upstream.

    During compaction, update_nr_listpages() has been used to count remaining
    non-migrated and free pages after a call to migrage_pages(). The
    freepages counting has become unneccessary, and it turns out that
    migratepages counting is also unnecessary in most cases.

    The only situation when it's needed to count cc->migratepages is when
    migrate_pages() returns with a negative error code. Otherwise, the
    non-negative return value is the number of pages that were not migrated,
    which is exactly the count of remaining pages in the cc->migratepages
    list.

    Furthermore, any non-zero count is only interesting for the tracepoint of
    mm_compaction_migratepages events, because after that all remaining
    unmigrated pages are put back and their count is set to 0.

    This patch therefore removes update_nr_listpages() completely, and changes
    the tracepoint definition so that the manual counting is done only when
    the tracepoint is enabled, and only when migrate_pages() returns a
    negative error code.

    Furthermore, migrate_pages() and the tracepoints won't be called when
    there's nothing to migrate. This potentially avoids some wasted cycles
    and reduces the volume of uninteresting mm_compaction_migratepages events
    where "nr_migrated=0 nr_failed=0". In the stress-highalloc mmtest, this
    was about 75% of the events. The mm_compaction_isolate_migratepages event
    is better for determining that nothing was isolated for migration, and
    this one was just duplicating the info.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Bartlomiej Zolnierkiewicz
    Acked-by: Michal Nazarewicz
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit e0b9daeb453e602a95ea43853dc12d385558ce1f upstream.

    We're going to want to manipulate the migration mode for compaction in the
    page allocator, and currently compact_control's sync field is only a bool.

    Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
    depending on the value of this bool. Convert the bool to enum
    migrate_mode and pass the migration mode in directly. Later, we'll want
    to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
    avoid unnecessary latency.

    This also alters compaction triggered from sysfs, either for the entire
    system or for a node, to force MIGRATE_SYNC.

    [akpm@linux-foundation.org: fix build]
    [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
    Signed-off-by: David Rientjes
    Suggested-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Cc: Naoya Horiguchi
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit 35979ef3393110ff3c12c6b94552208d3bdf1a36 upstream.

    Each zone has a cached migration scanner pfn for memory compaction so that
    subsequent calls to memory compaction can start where the previous call
    left off.

    Currently, the compaction migration scanner only updates the per-zone
    cached pfn when pageblocks were not skipped for async compaction. This
    creates a dependency on calling sync compaction to avoid having subsequent
    calls to async compaction from scanning an enormous amount of non-MOVABLE
    pageblocks each time it is called. On large machines, this could be
    potentially very expensive.

    This patch adds a per-zone cached migration scanner pfn only for async
    compaction. It is updated everytime a pageblock has been scanned in its
    entirety and when no pages from it were successfully isolated. The cached
    migration scanner pfn for sync compaction is updated only when called for
    sync compaction.

    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Greg Thelen
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit 68711a746345c44ae00c64d8dbac6a9ce13ac54a upstream.

    Memory migration uses a callback defined by the caller to determine how to
    allocate destination pages. When migration fails for a source page,
    however, it frees the destination page back to the system.

    This patch adds a memory migration callback defined by the caller to
    determine how to free destination pages. If a caller, such as memory
    compaction, builds its own freelist for migration targets, this can reuse
    already freed memory instead of scanning additional memory.

    If the caller provides a function to handle freeing of destination pages,
    it is called when page migration fails. If the caller passes NULL then
    freeing back to the system will be handled as usual. This patch
    introduces no functional change.

    Signed-off-by: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit 29f175d125f0f3a9503af8a5596f93d714cceb08 upstream.

    Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
    ra_submit with a single function call.

    Move ra_submit to internal.h and inline it to save some stack. Thanks
    to Andrew Morton for commenting different versions.

    Signed-off-by: Fabian Frederick
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Fabian Frederick
     
  • commit 67f9fd91f93c582b7de2ab9325b6e179db77e4d5 upstream.

    This patch removes read_cache_page_async() which wasn't really needed
    anywhere and simplifies the code around it a bit.

    read_cache_page_async() is useful when we want to read a page into the
    cache without waiting for it to complete. This happens when the
    appropriate callback 'filler' doesn't complete its read operation and
    releases the page lock immediately, and instead queues a different
    completion routine to do that. This never actually happened anywhere in
    the code.

    read_cache_page_async() had 3 different callers:

    - read_cache_page() which is the sync version, it would just wait for
    the requested read to complete using wait_on_page_read().

    - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
    function it supplied doesn't do any async reads, and would complete
    before the filler function returns - making it actually a sync read.

    - CRAMFS would call it using the read_mapping_page_async() wrapper, with
    a similar story to JFFS2 - the filler function doesn't do anything that
    reminds async reads and would always complete before the filler function
    returns.

    To sum it up, the code in mm/filemap.c never took advantage of having
    read_cache_page_async(). While there are filler callbacks that do async
    reads (such as the block one), we always called it with the
    read_cache_page().

    This patch adds a mandatory wait for read to complete when adding a new
    page to the cache, and removes read_cache_page_async() and its wrappers.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Sasha Levin
     
  • commit 0cd6144aadd2afd19d1aca880153530c52957604 upstream.

    shmem mappings already contain exceptional entries where swap slot
    information is remembered.

    To be able to store eviction information for regular page cache, prepare
    every site dealing with the radix trees directly to handle entries other
    than pages.

    The common lookup functions will filter out non-page entries and return
    NULL for page cache holes, just as before. But provide a raw version of
    the API which returns non-page entries as well, and switch shmem over to
    use it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     
  • commit e7b563bb2a6f4d974208da46200784b9c5b5a47e upstream.

    The radix tree hole searching code is only used for page cache, for
    example the readahead code trying to get a a picture of the area
    surrounding a fault.

    It sufficed to rely on the radix tree definition of holes, which is
    "empty tree slot". But this is about to change, though, as shadow page
    descriptors will be stored in the page cache after the actual pages get
    evicted from memory.

    Move the functions over to mm/filemap.c and make them native page cache
    operations, where they can later be adapted to handle the new definition
    of "page cache hole".

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     
  • commit 53c59f262d747ea82e7414774c59a489501186a0 upstream.

    Provide a function that does not just delete an entry at a given index,
    but also allows passing in an expected item. Delete only if that item
    is still located at the specified index.

    This is handy when lockless tree traversals want to delete entries as
    well because they don't have to do an second, locked lookup to verify
    the slot has not changed under them before deleting the entry.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     
  • commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.

    This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Davidlohr Bueso
     
  • commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.

    Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 943dca1a1fcbccb58de944669b833fd38a6c809b upstream.

    Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
    9TB memory machine since onlining memory sections is too slow. And we
    found out setup_zone_migrate_reserve spent >90% of the time.

    The problem is, setup_zone_migrate_reserve scans all pageblocks
    unconditionally, but it is only necessary if the number of reserved
    block was reduced (i.e. memory hot remove).

    Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means
    that the number of reserved pageblocks is almost always unchanged.

    This patch adds zone->nr_migrate_reserve_block to maintain the number of
    MIGRATE_RESERVE pageblocks and it reduces the overhead of
    setup_zone_migrate_reserve dramatically. The following table shows time
    of onlining a memory section.

    Amount of memory | 128GB | 192GB | 256GB|
    ---------------------------------------------
    linux-3.12 | 23.9 | 31.4 | 44.5 |
    This patch | 8.3 | 8.3 | 8.6 |
    Mel's proposal patch | 10.9 | 19.2 | 31.3 |
    ---------------------------------------------
    (millisecond)

    128GB : 4 nodes and each node has 32GB of memory
    192GB : 6 nodes and each node has 32GB of memory
    256GB : 8 nodes and each node has 32GB of memory

    (*1) Mel proposed his idea by the following threads.
    https://lkml.org/lkml/2013/10/30/272

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Yasuaki Ishimatsu
    Reported-by: Yasuaki Ishimatsu
    Tested-by: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Yasuaki Ishimatsu
     
  • commit 579f82901f6f41256642936d7e632f3979ad76d4 upstream.

    This is a patch to improve swap readahead algorithm. It's from Hugh and
    I slightly changed it.

    Hugh's original changelog:

    swapin readahead does a blind readahead, whether or not the swapin is
    sequential. This may be ok on harddisk, because large reads have
    relatively small costs, and if the readahead pages are unneeded they can
    be reclaimed easily - though, what if their allocation forced reclaim of
    useful pages? But on SSD devices large reads are more expensive than
    small ones: if the readahead pages are unneeded, reading them in caused
    significant overhead.

    This patch adds very simplistic random read detection. Stealing the
    PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
    vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
    simply looks at readahead's current success rate, and narrows or widens
    its readahead window accordingly. There is little science to its
    heuristic: it's about as stupid as can be whilst remaining effective.

    The table below shows elapsed times (in centiseconds) when running a
    single repetitive swapping load across a 1000MB mapping in 900MB ram
    with 1GB swap (the harddisk tests had taken painfully too long when I
    used mem=500M, but SSD shows similar results for that).

    Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
    Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
    which Shaohua showed to be defective; HughNew this Nov 14 patch, with
    page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
    patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
    (1-page reads: no readahead).

    HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for
    sequential access to the mapping, cycling five times around; Rand for
    the same number of random touches. Anon for a MAP_PRIVATE anon mapping;
    Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

    One weakness of Shaohua's vma/anon_vma approach was that it did not
    optimize Shmem: seen below. Konstantin's approach was perhaps mistuned,
    50% slower on Seq: did not compete and is not shown below.

    HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon 73921 76210 75611 76904 78191 121542
    Seq Shmem 73601 73176 73855 72947 74543 118322
    Rand Anon 895392 831243 871569 845197 846496 841680
    Rand Shmem 1058375 1053486 827935 764955 764376 756489

    SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon 24634 24198 24673 25107 21614 70018
    Seq Shmem 24959 24932 25052 25703 22030 69678
    Rand Anon 43014 26146 28075 25989 26935 25901
    Rand Shmem 45349 45215 28249 24268 24138 24332

    These tests are, of course, two extremes of a very simple case: under
    heavier mixed loads I've not yet observed any consistent improvement or
    degradation, and wider testing would be welcome.

    Shaohua Li:

    Test shows Vanilla is slightly better in sequential workload than Hugh's
    patch. I observed with Hugh's patch sometimes the readahead size is
    shrinked too fast (from 8 to 1 immediately) in sequential workload if
    there is no hit. And in such case, continuing doing readahead is good
    actually.

    I don't prepare a sophisticated algorithm for the sequential workload
    because so far we can't guarantee sequential accessed pages are swap out
    sequentially. So I slightly change Hugh's heuristic - don't shrink
    readahead size too fast.

    Here is my test result (unit second, 3 runs average):
    Vanilla Hugh New
    Seq 356 370 360
    Random 4525 2447 2444

    Attached graph is the swapin/swapout throughput I collected with 'vmstat
    2'. The first part is running a random workload (till around 1200 of
    the x-axis) and the second part is running a sequential workload.
    swapin and swapout throughput are almost identical in steady state in
    both workloads. These are expected behavior. while in Vanilla, swapin
    is much bigger than swapout especially in random workload (because wrong
    readahead).

    Original patches by: Shaohua Li and Konstantin Khlebnikov.

    [fengguang.wu@intel.com: swapin_nr_pages() can be static]
    Signed-off-by: Hugh Dickins
    Signed-off-by: Shaohua Li
    Signed-off-by: Fengguang Wu
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Shaohua Li
     
  • commit de6c60a6c115acaa721cfd499e028a413d1fcbf3 upstream.

    Currently there are several functions to manipulate the deferred
    compaction state variables. The remaining case where the variables are
    touched directly is when a successful allocation occurs in direct
    compaction, or is expected to be successful in the future by kswapd.
    Here, the lowest order that is expected to fail is updated, and in the
    case of successful allocation, the deferred status and counter is reset
    completely.

    Create a new function compaction_defer_reset() to encapsulate this
    functionality and make it easier to understand the code. No functional
    change.

    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit 0eb927c0ab789d3d7d69f68acb850f69d4e7c36f upstream.

    The broad goal of the series is to improve allocation success rates for
    huge pages through memory compaction, while trying not to increase the
    compaction overhead. The original objective was to reintroduce
    capturing of high-order pages freed by the compaction, before they are
    split by concurrent activity. However, several bugs and opportunities
    for simple improvements were found in the current implementation, mostly
    through extra tracepoints (which are however too ugly for now to be
    considered for sending).

    The patches mostly deal with two mechanisms that reduce compaction
    overhead, which is caching the progress of migrate and free scanners,
    and marking pageblocks where isolation failed to be skipped during
    further scans.

    Patch 1 (from mgorman) adds tracepoints that allow calculate time spent in
    compaction and potentially debug scanner pfn values.

    Patch 2 encapsulates the some functionality for handling deferred compactions
    for better maintainability, without a functional change
    type is not determined without being actually needed.

    Patch 3 fixes a bug where cached scanner pfn's are sometimes reset only after
    they have been read to initialize a compaction run.

    Patch 4 fixes a bug where scanners meeting is sometimes not properly detected
    and can lead to multiple compaction attempts quitting early without
    doing any work.

    Patch 5 improves the chances of sync compaction to process pageblocks that
    async compaction has skipped due to being !MIGRATE_MOVABLE.

    Patch 6 improves the chances of sync direct compaction to actually do anything
    when called after async compaction fails during allocation slowpath.

    The impact of patches were validated using mmtests's stress-highalloc
    benchmark with mmtests's stress-highalloc benchmark on a x86_64 machine
    with 4GB memory.

    Due to instability of the results (mostly related to the bugs fixed by
    patches 2 and 3), 10 iterations were performed, taking min,mean,max
    values for success rates and mean values for time and vmstat-based
    metrics.

    First, the default GFP_HIGHUSER_MOVABLE allocations were tested with the
    patches stacked on top of v3.13-rc2. Patch 2 is OK to serve as baseline
    due to no functional changes in 1 and 2. Comments below.

    stress-highalloc
    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
    Success 1 Min 9.00 ( 0.00%) 10.00 (-11.11%) 43.00 (-377.78%) 43.00 (-377.78%) 33.00 (-266.67%)
    Success 1 Mean 27.50 ( 0.00%) 25.30 ( 8.00%) 45.50 (-65.45%) 45.90 (-66.91%) 46.30 (-68.36%)
    Success 1 Max 36.00 ( 0.00%) 36.00 ( 0.00%) 47.00 (-30.56%) 48.00 (-33.33%) 52.00 (-44.44%)
    Success 2 Min 10.00 ( 0.00%) 8.00 ( 20.00%) 46.00 (-360.00%) 45.00 (-350.00%) 35.00 (-250.00%)
    Success 2 Mean 26.40 ( 0.00%) 23.50 ( 10.98%) 47.30 (-79.17%) 47.60 (-80.30%) 48.10 (-82.20%)
    Success 2 Max 34.00 ( 0.00%) 33.00 ( 2.94%) 48.00 (-41.18%) 50.00 (-47.06%) 54.00 (-58.82%)
    Success 3 Min 65.00 ( 0.00%) 63.00 ( 3.08%) 85.00 (-30.77%) 84.00 (-29.23%) 85.00 (-30.77%)
    Success 3 Mean 76.70 ( 0.00%) 70.50 ( 8.08%) 86.20 (-12.39%) 85.50 (-11.47%) 86.00 (-12.13%)
    Success 3 Max 87.00 ( 0.00%) 86.00 ( 1.15%) 88.00 ( -1.15%) 87.00 ( 0.00%) 87.00 ( 0.00%)

    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
    User 6437.72 6459.76 5960.32 5974.55 6019.67
    System 1049.65 1049.09 1029.32 1031.47 1032.31
    Elapsed 1856.77 1874.48 1949.97 1994.22 1983.15

    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-nothp 3-nothp 4-nothp 5-nothp 6-nothp
    Minor Faults 253952267 254581900 250030122 250507333 250157829
    Major Faults 420 407 506 530 530
    Swap Ins 4 9 9 6 6
    Swap Outs 398 375 345 346 333
    Direct pages scanned 197538 189017 298574 287019 299063
    Kswapd pages scanned 1809843 1801308 1846674 1873184 1861089
    Kswapd pages reclaimed 1806972 1798684 1844219 1870509 1858622
    Direct pages reclaimed 197227 188829 298380 286822 298835
    Kswapd efficiency 99% 99% 99% 99% 99%
    Kswapd velocity 953.382 970.449 952.243 934.569 922.286
    Direct efficiency 99% 99% 99% 99% 99%
    Direct velocity 104.058 101.832 153.961 143.200 148.205
    Percentage direct scans 9% 9% 13% 13% 13%
    Zone normal velocity 347.289 359.676 348.063 339.933 332.983
    Zone dma32 velocity 710.151 712.605 758.140 737.835 737.507
    Zone dma velocity 0.000 0.000 0.000 0.000 0.000
    Page writes by reclaim 557.600 429.000 353.600 426.400 381.800
    Page writes file 159 53 7 79 48
    Page writes anon 398 375 345 346 333
    Page reclaim immediate 825 644 411 575 420
    Sector Reads 2781750 2769780 2878547 2939128 2910483
    Sector Writes 12080843 12083351 12012892 12002132 12010745
    Page rescued immediate 0 0 0 0 0
    Slabs scanned 1575654 1545344 1778406 1786700 1794073
    Direct inode steals 9657 10037 15795 14104 14645
    Kswapd inode steals 46857 46335 50543 50716 51796
    Kswapd skipped wait 0 0 0 0 0
    THP fault alloc 97 91 81 71 77
    THP collapse alloc 456 506 546 544 565
    THP splits 6 5 5 4 4
    THP fault fallback 0 1 0 0 0
    THP collapse fail 14 14 12 13 12
    Compaction stalls 1006 980 1537 1536 1548
    Compaction success 303 284 562 559 578
    Compaction failures 702 696 974 976 969
    Page migrate success 1177325 1070077 3927538 3781870 3877057
    Page migrate failure 0 0 0 0 0
    Compaction pages isolated 2547248 2306457 8301218 8008500 8200674
    Compaction migrate scanned 42290478 38832618 153961130 154143900 159141197
    Compaction free scanned 89199429 79189151 356529027 351943166 356326727
    Compaction cost 1566 1426 5312 5156 5294
    NUMA PTE updates 0 0 0 0 0
    NUMA hint faults 0 0 0 0 0
    NUMA hint local faults 0 0 0 0 0
    NUMA hint local percent 100 100 100 100 100
    NUMA pages migrated 0 0 0 0 0
    AutoNUMA cost 0 0 0 0 0

    Observations:

    - The "Success 3" line is allocation success rate with system idle
    (phases 1 and 2 are with background interference). I used to get stable
    values around 85% with vanilla 3.11. The lower min and mean values came
    with 3.12. This was bisected to commit 81c0a2bb ("mm: page_alloc: fair
    zone allocator policy") As explained in comment for patch 3, I don't
    think the commit is wrong, but that it makes the effect of compaction
    bugs worse. From patch 3 onwards, the results are OK and match the 3.11
    results.

    - Patch 4 also clearly helps phases 1 and 2, and exceeds any results
    I've seen with 3.11 (I didn't measure it that thoroughly then, but it
    was never above 40%).

    - Compaction cost and number of scanned pages is higher, especially due
    to patch 4. However, keep in mind that patches 3 and 4 fix existing
    bugs in the current design of compaction overhead mitigation, they do
    not change it. If overhead is found unacceptable, then it should be
    decreased differently (and consistently, not due to random conditions)
    than the current implementation does. In contrast, patches 5 and 6
    (which are not strictly bug fixes) do not increase the overhead (but
    also not success rates). This might be a limitation of the
    stress-highalloc benchmark as it's quite uniform.

    Another set of results is when configuring stress-highalloc t allocate
    with similar flags as THP uses:
    (GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD)

    stress-highalloc
    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-thp 3-thp 4-thp 5-thp 6-thp
    Success 1 Min 2.00 ( 0.00%) 7.00 (-250.00%) 18.00 (-800.00%) 19.00 (-850.00%) 26.00 (-1200.00%)
    Success 1 Mean 19.20 ( 0.00%) 17.80 ( 7.29%) 29.20 (-52.08%) 29.90 (-55.73%) 32.80 (-70.83%)
    Success 1 Max 27.00 ( 0.00%) 29.00 ( -7.41%) 35.00 (-29.63%) 36.00 (-33.33%) 37.00 (-37.04%)
    Success 2 Min 3.00 ( 0.00%) 8.00 (-166.67%) 21.00 (-600.00%) 21.00 (-600.00%) 32.00 (-966.67%)
    Success 2 Mean 19.30 ( 0.00%) 17.90 ( 7.25%) 32.20 (-66.84%) 32.60 (-68.91%) 35.70 (-84.97%)
    Success 2 Max 27.00 ( 0.00%) 30.00 (-11.11%) 36.00 (-33.33%) 37.00 (-37.04%) 39.00 (-44.44%)
    Success 3 Min 62.00 ( 0.00%) 62.00 ( 0.00%) 85.00 (-37.10%) 75.00 (-20.97%) 64.00 ( -3.23%)
    Success 3 Mean 66.30 ( 0.00%) 65.50 ( 1.21%) 85.60 (-29.11%) 83.40 (-25.79%) 83.50 (-25.94%)
    Success 3 Max 70.00 ( 0.00%) 69.00 ( 1.43%) 87.00 (-24.29%) 86.00 (-22.86%) 87.00 (-24.29%)

    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-thp 3-thp 4-thp 5-thp 6-thp
    User 6547.93 6475.85 6265.54 6289.46 6189.96
    System 1053.42 1047.28 1043.23 1042.73 1038.73
    Elapsed 1835.43 1821.96 1908.67 1912.74 1956.38

    3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2
    2-thp 3-thp 4-thp 5-thp 6-thp
    Minor Faults 256805673 253106328 253222299 249830289 251184418
    Major Faults 395 375 423 434 448
    Swap Ins 12 10 10 12 9
    Swap Outs 530 537 487 455 415
    Direct pages scanned 71859 86046 153244 152764 190713
    Kswapd pages scanned 1900994 1870240 1898012 1892864 1880520
    Kswapd pages reclaimed 1897814 1867428 1894939 1890125 1877924
    Direct pages reclaimed 71766 85908 153167 152643 190600
    Kswapd efficiency 99% 99% 99% 99% 99%
    Kswapd velocity 1029.000 1067.782 1000.091 991.049 951.218
    Direct efficiency 99% 99% 99% 99% 99%
    Direct velocity 38.897 49.127 80.747 79.983 96.468
    Percentage direct scans 3% 4% 7% 7% 9%
    Zone normal velocity 351.377 372.494 348.910 341.689 335.310
    Zone dma32 velocity 716.520 744.414 731.928 729.343 712.377
    Zone dma velocity 0.000 0.000 0.000 0.000 0.000
    Page writes by reclaim 669.300 604.000 545.700 538.900 429.900
    Page writes file 138 66 58 83 14
    Page writes anon 530 537 487 455 415
    Page reclaim immediate 806 655 772 548 517
    Sector Reads 2711956 2703239 2811602 2818248 2839459
    Sector Writes 12163238 12018662 12038248 11954736 11994892
    Page rescued immediate 0 0 0 0 0
    Slabs scanned 1385088 1388364 1507968 1513292 1558656
    Direct inode steals 1739 2564 4622 5496 6007
    Kswapd inode steals 47461 46406 47804 48013 48466
    Kswapd skipped wait 0 0 0 0 0
    THP fault alloc 110 82 84 69 70
    THP collapse alloc 445 482 467 462 539
    THP splits 6 5 4 5 3
    THP fault fallback 3 0 0 0 0
    THP collapse fail 15 14 14 14 13
    Compaction stalls 659 685 1033 1073 1111
    Compaction success 222 225 410 427 456
    Compaction failures 436 460 622 646 655
    Page migrate success 446594 439978 1085640 1095062 1131716
    Page migrate failure 0 0 0 0 0
    Compaction pages isolated 1029475 1013490 2453074 2482698 2565400
    Compaction migrate scanned 9955461 11344259 24375202 27978356 30494204
    Compaction free scanned 27715272 28544654 80150615 82898631 85756132
    Compaction cost 552 555 1344 1379 1436
    NUMA PTE updates 0 0 0 0 0
    NUMA hint faults 0 0 0 0 0
    NUMA hint local faults 0 0 0 0 0
    NUMA hint local percent 100 100 100 100 100
    NUMA pages migrated 0 0 0 0 0
    AutoNUMA cost 0 0 0 0 0

    There are some differences from the previous results for THP-like allocations:

    - Here, the bad result for unpatched kernel in phase 3 is much more
    consistent to be between 65-70% and not related to the "regression" in
    3.12. Still there is the improvement from patch 4 onwards, which brings
    it on par with simple GFP_HIGHUSER_MOVABLE allocations.

    - Compaction costs have increased, but nowhere near as much as the
    non-THP case. Again, the patches should be worth the gained
    determininsm.

    - Patches 5 and 6 somewhat increase the number of migrate-scanned pages.
    This is most likely due to __GFP_NO_KSWAPD flag, which means the cached
    pfn's and pageblock skip bits are not reset by kswapd that often (at
    least in phase 3 where no concurrent activity would wake up kswapd) and
    the patches thus help the sync-after-async compaction. It doesn't
    however show that the sync compaction would help so much with success
    rates, which can be again seen as a limitation of the benchmark
    scenario.

    This patch (of 6):

    Add two tracepoints for compaction begin and end of a zone. Using this it
    is possible to calculate how much time a workload is spending within
    compaction and potentially debug problems related to cached pfns for
    scanning. In combination with the direct reclaim and slab trace points it
    should be possible to estimate most allocation-related overhead for a
    workload.

    Signed-off-by: Mel Gorman
    Signed-off-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit ec65993443736a5091b68e80ff1734548944a4b8 upstream.

    Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
    vmstats: tlb flush counters") to cause overhead problems.

    The counters are undeniably useful but how often do we really
    need to debug TLB flush related issues? It does not justify
    taking the penalty everywhere so make it a debugging option.

    Signed-off-by: Mel Gorman
    Tested-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Alex Shi
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 52c8f6a5aeb0bdd396849ecaa72d96f8175528f5 upstream.

    In general, every tracepoint should be zero overhead if it is disabled.
    However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate
    "new_type == start_migratetype" even if tracepoint is disabled.

    However, the code can be moved into tracepoint's TP_fast_assign() and
    TP_fast_assign exist exactly such purpose. This patch does it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    KOSAKI Motohiro
     
  • commit 18ab4d4ced0817421e6db6940374cc39d28d65da upstream.

    Originally get_swap_page() started iterating through the singly-linked
    list of swap_info_structs using swap_list.next or highest_priority_index,
    which both were intended to point to the highest priority active swap
    target that was not full. The first patch in this series changed the
    singly-linked list to a doubly-linked list, and removed the logic to start
    at the highest priority non-full entry; it starts scanning at the highest
    priority entry each time, even if the entry is full.

    Replace the manually ordered swap_list_head with a plist, swap_active_head.
    Add a new plist, swap_avail_head. The original swap_active_head plist
    contains all active swap_info_structs, as before, while the new
    swap_avail_head plist contains only swap_info_structs that are active and
    available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect
    the swap_avail_head list.

    Mel Gorman suggested using plists since they internally handle ordering
    the list entries based on priority, which is exactly what swap was doing
    manually. All the ordering code is now removed, and swap_info_struct
    entries and simply added to their corresponding plist and automatically
    ordered correctly.

    Using a new plist for available swap_info_structs simplifies and
    optimizes get_swap_page(), which no longer has to iterate over full
    swap_info_structs. Using a new spinlock for swap_avail_head plist
    allows each swap_info_struct to add or remove themselves from the
    plist when they become full or not-full; previously they could not
    do so because the swap_info_struct->lock is held when they change
    from fullnot-full, and the swap_lock protecting the main
    swap_active_head must be ordered before any swap_info_struct->lock.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Dan Streetman
     
  • commit a75f232ce0fe38bd01301899ecd97ffd0254316a upstream.

    Add plist_requeue(), which moves the specified plist_node after all other
    same-priority plist_nodes in the list. This is essentially an optimized
    plist_del() followed by plist_add().

    This is needed by swap, which (with the next patch in this set) uses a
    plist of available swap devices. When a swap device (either a swap
    partition or swap file) are added to the system with swapon(), the device
    is added to a plist, ordered by the swap device's priority. When swap
    needs to allocate a page from one of the swap devices, it takes the page
    from the first swap device on the plist, which is the highest priority
    swap device. The swap device is left in the plist until all its pages are
    used, and then removed from the plist when it becomes full.

    However, as described in man 2 swapon, swap must allocate pages from swap
    devices with the same priority in round-robin order; to do this, on each
    swap page allocation, swap uses a page from the first swap device in the
    plist, and then calls plist_requeue() to move that swap device entry to
    after any other same-priority swap devices. The next swap page allocation
    will again use a page from the first swap device in the plist and requeue
    it, and so on, resulting in round-robin usage of equal-priority swap
    devices.

    Also add plist_test_requeue() test function, for use by plist_test() to
    test plist_requeue() function.

    Signed-off-by: Dan Streetman
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Acked-by: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Dan Streetman
     
  • commit fd16618e12a05df79a3439d72d5ffdac5d34f3da upstream.

    Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
    define and initialize a struct plist_head.

    Add plist_for_each_continue() and plist_for_each_entry_continue(),
    equivalent to list_for_each_continue() and list_for_each_entry_continue(),
    to iterate over a plist continuing after the current position.

    Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
    and ->next, implemented by list_prev_entry() and list_next_entry(), to
    access the prev/next struct plist_node entry. These are needed because
    unlike struct list_head, direct access of the prev/next struct plist_node
    isn't possible; the list must be navigated via the contained struct
    list_head. e.g. instead of accessing the prev by list_prev_entry(node,
    node_list) it can be accessed by plist_prev(node).

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Dan Streetman
     
  • commit adfab836f4908deb049a5128082719e689eed964 upstream.

    The logic controlling the singly-linked list of swap_info_struct entries
    for all active, i.e. swapon'ed, swap targets is rather complex, because:

    - it stores the entries in priority order
    - there is a pointer to the highest priority entry
    - there is a pointer to the highest priority not-full entry
    - there is a highest_priority_index variable set outside the swap_lock
    - swap entries of equal priority should be used equally

    this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
    where different priority swap targets are incorrectly used equally.

    That bug probably could be solved with the existing singly-linked lists,
    but I think it would only add more complexity to the already difficult to
    understand get_swap_page() swap_list iteration logic.

    The first patch changes from a singly-linked list to a doubly-linked list
    using list_heads; the highest_priority_index and related code are removed
    and get_swap_page() starts each iteration at the highest priority
    swap_info entry, even if it's full. While this does introduce unnecessary
    list iteration (i.e. Schlemiel the painter's algorithm) in the case where
    one or more of the highest priority entries are full, the iteration and
    manipulation code is much simpler and behaves correctly re: the above bug;
    and the fourth patch removes the unnecessary iteration.

    The second patch adds some minor plist helper functions; nothing new
    really, just functions to match existing regular list functions. These
    are used by the next two patches.

    The third patch adds plist_requeue(), which is used by get_swap_page() in
    the next patch - it performs the requeueing of same-priority entries
    (which moves the entry to the end of its priority in the plist), so that
    all equal-priority swap_info_structs get used equally.

    The fourth patch converts the main list into a plist, and adds a new plist
    that contains only swap_info entries that are both active and not full.
    As Mel suggested using plists allows removing all the ordering code from
    swap - plists handle ordering automatically. The list naming is also
    clarified now that there are two lists, with the original list changed
    from swap_list_head to swap_active_head and the new list named
    swap_avail_head. A new spinlock is also added for the new list, so
    swap_info entries can be added or removed from the new list immediately as
    they become full or not full.

    This patch (of 4):

    Replace the singly-linked list tracking active, i.e. swapon'ed,
    swap_info_struct entries with a doubly-linked list using struct
    list_heads. Simplify the logic iterating and manipulating the list of
    entries, especially get_swap_page(), by using standard list_head
    functions, and removing the highest priority iteration logic.

    The change fixes the bug:
    https://lkml.org/lkml/2014/2/13/181
    in which different priority swap entries after the highest priority entry
    are incorrectly used equally in pairs. The swap behavior is now as
    advertised, i.e. different priority swap entries are used in order, and
    equal priority swap targets are used concurrently.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Dan Streetman
     
  • commit 457c1b27ed56ec472d202731b12417bff023594a upstream.

    Currently, I am seeing the following when I `mount -t hugetlbfs /none
    /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
    related to the fact that hugetlbfs is properly not correctly setting
    itself up in this state?:

    Unable to handle kernel paging request for data at address 0x00000031
    Faulting instruction address: 0xc000000000245710
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    ....

    In KVM guests on Power, in a guest not backed by hugepages, we see the
    following:

    AnonHugePages: 0 kB
    HugePages_Total: 0
    HugePages_Free: 0
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 64 kB

    HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
    are not supported at boot-time, but this is only checked in
    hugetlb_init(). Extract the check to a helper function, and use it in a
    few relevant places.

    This does make hugetlbfs not supported (not registered at all) in this
    environment. I believe this is fine, as there are no valid hugepages
    and that won't change at runtime.

    [akpm@linux-foundation.org: use pr_info(), per Mel]
    [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
    Signed-off-by: Nishanth Aravamudan
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Mel Gorman
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Nishanth Aravamudan
     

17 Sep, 2014

7 commits

  • commit db1044d458a287c18c4d413adc4ad12e92e253b5 upstream.

    added struct sockaddr_storage to rdma_user_cm.h without also adding an
    include for linux/socket.h to make sure it is defined. Systemtap
    needs the header files to build standalone and cannot rely on other
    files to pre-include other headers, so add linux/socket.h to the list
    of includes in this file.

    Fixes: ee7aed4528f ("RDMA/ucma: Support querying for AF_IB addresses")
    Signed-off-by: Doug Ledford
    Signed-off-by: Roland Dreier
    Signed-off-by: Jiri Slaby

    Doug Ledford
     
  • commit 0213436a2cc5e4a5ca2fabfaa4d3877097f3b13f upstream.

    Some devices don't like REPORT SUPPORTED OPERATION CODES and will
    simply timeout causing sd_mod init to take a very very long time.
    Introduce BLIST_NO_RSOC scsi scan flag, that stops RSOC from being
    issued. Add it to Promise Vtrak E610f entry in scsi scan
    blacklist. Fixes bug #79901 reported at
    https://bugzilla.kernel.org/show_bug.cgi?id=79901

    Fixes: 98dcc2946adb ("SCSI: sd: Update WRITE SAME heuristics")

    Signed-off-by: Janusz Dziemidowicz
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jiri Slaby

    Janusz Dziemidowicz
     
  • commit c1d40a527e885a40bb9ea6c46a1b1145d42b66a0 upstream.

    Despite supporting modern SCSI features some storage devices continue to
    claim conformance to an older version of the SPC spec. This is done for
    compatibility with legacy operating systems.

    Linux by default will not attempt to read VPD pages on devices that
    claim SPC-2 or older. Introduce a blacklist flag that can be used to
    trigger VPD page inquiries on devices that are known to support them.

    Reported-by: KY Srinivasan
    Tested-by: KY Srinivasan
    Reviewed-by: KY Srinivasan
    Signed-off-by: Martin K. Petersen
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jiri Slaby

    Martin K. Petersen
     
  • commit 22ffeb48b7584d6cd50f2a595ed6065d86a87459 upstream.

    Sequential scan for more than 256 LUNs is very fragile as
    LUNs might not be numbered sequentially after that point.

    SAM revisions later than SCSI-3 impose a structure on
    LUNs larger than 256, making LUN numbers between 256
    and 16384 illegal.
    SCSI-3, however allows for plain 64-bit numbers with
    no internal structure.

    So restrict sequential LUN scan to 256 LUNs and add a
    new blacklist flag 'BLIST_SCSI3LUN' to scan up to
    max_lun devices.

    Signed-off-by: Hannes Reinecke
    Reviewed-by: Ewan Milne
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jiri Slaby

    Hannes Reinecke
     
  • commit 7d8b6c63751cfbbe5eef81a48c22978b3407a3ad upstream.

    This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744
    plus fixing it a different way...

    We found, when trying to run an application from an application which
    had dropped privs that the kernel does security checks on undefined
    capability bits. This was ESPECIALLY difficult to debug as those
    undefined bits are hidden from /proc/$PID/status.

    Consider a root application which drops all capabilities from ALL 4
    capability sets. We assume, since the application is going to set
    eff/perm/inh from an array that it will clear not only the defined caps
    less than CAP_LAST_CAP, but also the higher 28ish bits which are
    undefined future capabilities.

    The BSET gets cleared differently. Instead it is cleared one bit at a
    time. The problem here is that in security/commoncap.c::cap_task_prctl()
    we actually check the validity of a capability being read. So any task
    which attempts to 'read all things set in bset' followed by 'unset all
    things set in bset' will not even attempt to unset the undefined bits
    higher than CAP_LAST_CAP.

    So the 'parent' will look something like:
    CapInh: 0000000000000000
    CapPrm: 0000000000000000
    CapEff: 0000000000000000
    CapBnd: ffffffc000000000

    All of this 'should' be fine. Given that these are undefined bits that
    aren't supposed to have anything to do with permissions. But they do...

    So lets now consider a task which cleared the eff/perm/inh completely
    and cleared all of the valid caps in the bset (but not the invalid caps
    it couldn't read out of the kernel). We know that this is exactly what
    the libcap-ng library does and what the go capabilities library does.
    They both leave you in that above situation if you try to clear all of
    you capapabilities from all 4 sets. If that root task calls execve()
    the child task will pick up all caps not blocked by the bset. The bset
    however does not block bits higher than CAP_LAST_CAP. So now the child
    task has bits in eff which are not in the parent. These are
    'meaningless' undefined bits, but still bits which the parent doesn't
    have.

    The problem is now in cred_cap_issubset() (or any operation which does a
    subset test) as the child, while a subset for valid cap bits, is not a
    subset for invalid cap bits! So now we set durring commit creds that
    the child is not dumpable. Given it is 'more priv' than its parent. It
    also means the parent cannot ptrace the child and other stupidity.

    The solution here:
    1) stop hiding capability bits in status
    This makes debugging easier!

    2) stop giving any task undefined capability bits. it's simple, it you
    don't put those invalid bits in CAP_FULL_SET you won't get them in init
    and you won't get them in any other task either.
    This fixes the cap_issubset() tests and resulting fallout (which
    made the init task in a docker container untraceable among other
    things)

    3) mask out undefined bits when sys_capset() is called as it might use
    ~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
    This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.

    4) mask out undefined bit when we read a file capability off of disk as
    again likely all bits are set in the xattr for forward/backward
    compatibility.
    This lets 'setcap all+pe /bin/bash; /bin/bash' run

    Signed-off-by: Eric Paris
    Reviewed-by: Kees Cook
    Cc: Andrew Vagin
    Cc: Andrew G. Morgan
    Cc: Serge E. Hallyn
    Cc: Kees Cook
    Cc: Steve Grubb
    Cc: Dan Walsh
    Signed-off-by: James Morris
    Signed-off-by: Jiri Slaby

    Eric Paris
     
  • commit 3c45ddf823d679a820adddd53b52c6699c9a05ac upstream.

    The current code always selects XPRT_TRANSPORT_BC_TCP for the back
    channel, even when the forward channel was not TCP (eg, RDMA). When
    a 4.1 mount is attempted with RDMA, the server panics in the TCP BC
    code when trying to send CB_NULL.

    Instead, construct the transport protocol number from the forward
    channel transport or'd with XPRT_TRANSPORT_BC. Transports that do
    not support bi-directional RPC will not have registered a "BC"
    transport, causing create_backchannel_client() to fail immediately.

    Fixes: https://bugzilla.linux-nfs.org/show_bug.cgi?id=265
    Signed-off-by: Chuck Lever
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Jiri Slaby

    Chuck Lever
     
  • commit db9ee220361de03ee86388f9ea5e529eaad5323c upstream.

    It turns out that there are some serious problems with the on-disk
    format of journal checksum v2. The foremost is that the function to
    calculate descriptor tag size returns sizes that are too big. This
    causes alignment issues on some architectures and is compounded by the
    fact that some parts of jbd2 use the structure size (incorrectly) to
    determine the presence of a 64bit journal instead of checking the
    feature flags.

    Therefore, introduce journal checksum v3, which enlarges the
    descriptor block tag format to allow for full 32-bit checksums of
    journal blocks, fix the journal tag function to return the correct
    sizes, and fix the jbd2 recovery code to use feature flags to
    determine 64bitness.

    Add a few function helpers so we don't have to open-code quite so
    many pieces.

    Switching to a 16-byte block size was found to increase journal size
    overhead by a maximum of 0.1%, to convert a 32-bit journal with no
    checksumming to a 32-bit journal with checksum v3 enabled.

    Signed-off-by: Darrick J. Wong
    Reported-by: TR Reardon
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Jiri Slaby

    Darrick J. Wong
     

04 Sep, 2014

3 commits


03 Sep, 2014

1 commit

  • commit c6bde215acfd637708142ae671843b6f0eadbc6d upstream.

    This adds a pci_upstream_bridge() interface to find the PCI-to-PCI bridge
    upstream from a device. This is typically just "dev->bus->self", but in
    the case of a VF on a virtual bus, we have to start from the corresponding
    PF. Returns NULL if there is no upstream PCI bridge, i.e., if the device
    is on a root bus.

    Signed-off-by: Bjorn Helgaas
    Acked-by: Yinghai Lu

    Signed-off-by: Jiri Slaby

    Bjorn Helgaas
     

02 Sep, 2014

1 commit

  • commit cbcd1054a1fd2aa980fc11ff28e436fc4aaa2d54 upstream.

    Commit 08778795 ("block: Fix nr_vecs for inline integrity vectors") from
    Martin introduces the function bip_integrity_vecs(get the useful vectors)
    to fix the issue about nr_vecs for inline integrity vectors that reported
    by David Milburn.

    But it seems that bip_integrity_vecs() will return the wrong number if the
    bio is not based on any bio_set for some reason(bio->bi_pool == NULL),
    because in that case, the bip_inline_vecs[0] is malloced directly. So
    here we add the bip_max_vcnt to record the count of vector slots, and
    cleanup the function bip_integrity_vecs().

    Signed-off-by: Gu Zheng
    Cc: Martin K. Petersen
    Cc: Kent Overstreet
    Signed-off-by: Jens Axboe

    Gu Zheng