11 Feb, 2015

1 commit

  • commit 23aaed6659df9adfabe9c583e67a36b54e21df46 upstream.

    walk_page_range() silently skips vma having VM_PFNMAP set, which leads
    to undesirable behaviour at client end (who called walk_page_range).
    Userspace applications get the wrong data, so the effect is like just
    confusing users (if the applications just display the data) or sometimes
    killing the processes (if the applications do something with
    misunderstanding virtual addresses due to the wrong data.)

    For example for pagemap_read, when no callbacks are called against
    VM_PFNMAP vma, pagemap_read may prepare pagemap data for next virtual
    address range at wrong index.

    Eventually userspace may get wrong pagemap data for a task.
    Corresponding to a VM_PFNMAP marked vma region, kernel may report
    mappings from subsequent vma regions. User space in turn may account
    more pages (than really are) to the task.

    In my case I was using procmem, procrack (Android utility) which uses
    pagemap interface to account RSS pages of a task. Due to this bug it
    was giving a wrong picture for vmas (with VM_PFNMAP set).

    Fixes: a9ff785e4437 ("mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas")
    Signed-off-by: Shiraz Hashim
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Shiraz Hashim
     

30 Jan, 2015

30 commits

  • commit 45f87de57f8fad59302fd263dd81ffa4843b5b24 upstream.

    Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
    cache allocation where possible") has added a separate parameter for
    specifying gfp mask for radix tree allocations.

    Not only this is less than optimal from the API point of view because it
    is error prone, it is also buggy currently because
    grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if
    fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by
    AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then
    the radix tree allocation wouldn't obey the restriction and might
    recurse into filesystem and cause deadlocks. This is the case for most
    filesystems unfortunately because only ext4 and gfs2 are using
    AOP_FLAG_NOFS.

    Let's simply remove radix_gfp_mask parameter because the allocation
    context is same for both page cache and for the radix tree. Just make
    sure that the radix tree gets only the sane subset of the mask (e.g. do
    not pass __GFP_WRITE).

    Long term it is more preferable to convert remaining users of
    AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this
    interface even further.

    Reported-by: Dave Chinner
    Signed-off-by: Michal Hocko
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit 4ffeaf3560a52b4a69cc7909873d08c0ef5909d4 upstream.

    The fair zone allocation policy round-robins allocations between zones
    within a node to avoid age inversion problems during reclaim. If the
    first allocation fails, the batch counts are reset and a second attempt
    made before entering the slow path.

    One assumption made with this scheme is that batches expire at roughly
    the same time and the resets each time are justified. This assumption
    does not hold when zones reach their low watermark as the batches will
    be consumed at uneven rates. Allocation failure due to watermark
    depletion result in additional zonelist scans for the reset and another
    watermark check before hitting the slowpath.

    On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA
    machine it's variable due to the variability of measuring overhead with
    the vmstat changes. The system CPU overhead comparison looks like

    3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
    vanilla vmstat-v5 lowercost-v5
    User 746.94 774.56 802.00
    System 65336.22 32847.27 40852.33
    Elapsed 27553.52 27415.04 27368.46

    However it is worth noting that the overall benchmark still completed
    faster and intuitively it makes sense to take as few passes as possible
    through the zonelists.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit f7b5d647946aae1647bf5cd26c16b3a793c1ac49 upstream.

    The purpose of numa_zonelist_order=zone is to preserve lower zones for
    use with 32-bit devices. If locality is preferred then the
    numa_zonelist_order=node policy should be used.

    Unfortunately, the fair zone allocation policy overrides this by
    skipping zones on remote nodes until the lower one is found. While this
    makes sense from a page aging and performance perspective, it breaks the
    expected zonelist policy. This patch restores the expected behaviour
    for zone-list ordering.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit bb0b6dffa2ccfbd9747ad0cc87c7459622896e60 upstream.

    When kswapd is awake reclaiming, the per-cpu stat thresholds are lowered
    to get more accurate counts to avoid breaching watermarks. This
    threshold update iterates over all possible CPUs which is unnecessary.
    Only online CPUs need to be updated. If a new CPU is onlined,
    refresh_zone_stat_thresholds() will set the thresholds correctly.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 0d5d823ab4e608ec7b52ac4410de4cb74bbe0edd upstream.

    zone->pages_scanned is a write-intensive cache line during page reclaim
    and it's also updated during page free. Move the counter into vmstat to
    take advantage of the per-cpu updates and do not update it in the free
    paths unless necessary.

    On a small UMA machine running tiobench the difference is marginal. On
    a 4-node machine the overhead is more noticable. Note that automatic
    NUMA balancing was disabled for this test as otherwise the system CPU
    overhead is unpredictable.

    3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
    vanillarearrange-v5 vmstat-v5
    User 746.94 759.78 774.56
    System 65336.22 58350.98 32847.27
    Elapsed 27553.52 27282.02 27415.04

    Note that the overhead reduction will vary depending on where exactly
    pages are allocated and freed.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 3484b2de9499df23c4604a513b36f96326ae81ad upstream.

    The arrangement of struct zone has changed over time and now it has
    reached the point where there is some inappropriate sharing going on.
    On x86-64 for example

    o The zone->node field is shared with the zone lock and zone->node is
    accessed frequently from the page allocator due to the fair zone
    allocation policy.

    o span_seqlock is almost never used by shares a line with free_area

    o Some zone statistics share a cache line with the LRU lock so
    reclaim-intensive and allocator-intensive workloads can bounce the cache
    line on a stat update

    This patch rearranges struct zone to put read-only and read-mostly
    fields together and then splits the page allocator intensive fields, the
    zone statistics and the page reclaim intensive fields into their own
    cache lines. Note that the type of lowmem_reserve changes due to the
    watermark calculations being signed and avoiding a signed/unsigned
    conversion there.

    On the test configuration I used the overall size of struct zone shrunk
    by one cache line. On smaller machines, this is not likely to be
    noticable. However, on a 4-node NUMA machine running tiobench the
    system CPU overhead is reduced by this patch.

    3.16.0-rc3 3.16.0-rc3
    vanillarearrange-v5r9
    User 746.94 759.78
    System 65336.22 58350.98
    Elapsed 27553.52 27282.02

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 24b7e5819ad5cbef2b7c7376510862aa8319d240 upstream.

    This was formerly the series "Improve sequential read throughput" which
    noted some major differences in performance of tiobench since 3.0.
    While there are a number of factors, two that dominated were the
    introduction of the fair zone allocation policy and changes to CFQ.

    The behaviour of fair zone allocation policy makes more sense than
    tiobench as a benchmark and CFQ defaults were not changed due to
    insufficient benchmarking.

    This series is what's left. It's one functional fix to the fair zone
    allocation policy when used on NUMA machines and a reduction of overhead
    in general. tiobench was used for the comparison despite its flaws as
    an IO benchmark as in this case we are primarily interested in the
    overhead of page allocator and page reclaim activity.

    On UMA, it makes little difference to overhead

    3.16.0-rc3 3.16.0-rc3
    vanilla lowercost-v5
    User 383.61 386.77
    System 403.83 401.74
    Elapsed 5411.50 5413.11

    On a 4-socket NUMA machine it's a bit more noticable

    3.16.0-rc3 3.16.0-rc3
    vanilla lowercost-v5
    User 746.94 802.00
    System 65336.22 40852.33
    Elapsed 27553.52 27368.46

    This patch (of 6):

    The LRU insertion and activate tracepoints take PFN as a parameter
    forcing the overhead to the caller. Move the overhead to the tracepoint
    fast-assign method to ensure the cost is only incurred when the
    tracepoint is active.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 2ab051e11bfa3cbb7b24177f3d6aaed10a0d743e upstream.

    When memory cgoups are enabled, the code that decides to force to scan
    anonymous pages in get_scan_count() compares global values (free,
    high_watermark) to a value that is restricted to a memory cgroup (file).
    It make the code over-eager to force anon scan.

    For instance, it will force anon scan when scanning a memcg that is
    mainly populated by anonymous page, even when there is plenty of file
    pages to get rid of in others memcgs, even when swappiness == 0. It
    breaks user's expectation about swappiness and hurts performance.

    This patch makes sure that forced anon scan only happens when there not
    enough file pages for the all zone, not just in one random memcg.

    [hannes@cmpxchg.org: cleanups]
    Signed-off-by: Jerome Marchand
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Jerome Marchand
     
  • commit 474750aba88817c53f39424e5567b8e4acc4b39b upstream.

    Richard Yao reported a month ago that his system have a trouble with
    vmap_area_lock contention during performance analysis by /proc/meminfo.
    Andrew asked why his analysis checks /proc/meminfo stressfully, but he
    didn't answer it.

    https://lkml.org/lkml/2014/4/10/416

    Although I'm not sure that this is right usage or not, there is a
    solution reducing vmap_area_lock contention with no side-effect. That
    is just to use rcu list iterator in get_vmalloc_info().

    rcu can be used in this function because all RCU protocol is already
    respected by writers, since Nick Piggin commit db64fe02258f1 ("mm:
    rewrite vmap layer") back in linux-2.6.28

    Specifically :
    insertions use list_add_rcu(),
    deletions use list_del_rcu() and kfree_rcu().

    Note the rb tree is not used from rcu reader (it would not be safe),
    only the vmap_area_list has full RCU protection.

    Note that __purge_vmap_area_lazy() already uses this rcu protection.

    rcu_read_lock();
    list_for_each_entry_rcu(va, &vmap_area_list, list) {
    if (va->flags & VM_LAZY_FREE) {
    if (va->va_start < *start)
    *start = va->va_start;
    if (va->va_end > *end)
    *end = va->va_end;
    nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
    list_add_tail(&va->purge_list, &valist);
    va->flags |= VM_LAZY_FREEING;
    va->flags &= ~VM_LAZY_FREE;
    }
    }
    rcu_read_unlock();

    Peter:

    : While rcu list traversal over the vmap_area_list is safe, this may
    : arrive at different results than the spinlocked version. The rcu list
    : traversal version will not be a 'snapshot' of a single, valid instant
    : of the entire vmap_area_list, but rather a potential amalgam of
    : different list states.

    Joonsoo:

    : Yes, you are right, but I don't think that we should be strict here.
    : Meminfo is already not a 'snapshot' at specific time. While we try to get
    : certain stats, the other stats can change. And, although we may arrive at
    : different results than the spinlocked version, the difference would not be
    : large and would not make serious side-effect.

    [edumazet@google.com: add more commit description]
    Signed-off-by: Joonsoo Kim
    Reported-by: Richard Yao
    Acked-by: Eric Dumazet
    Cc: Peter Hurley
    Cc: Zhang Yanfei
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Joonsoo Kim
     
  • commit 21bda264f4243f61dfcc485174055f12ad0530b4 upstream.

    Commit 71e3aac0724f ("thp: transparent hugepage core") adds
    copy_pte_range prototype to huge_mm.h. I'm not sure why (or if) this
    function have been used outside of memory.c, but it currently isn't.
    This patch makes copy_pte_range() static again.

    Signed-off-by: Jerome Marchand
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Jerome Marchand
     
  • commit 14a4e2141e24304fff2c697be6382ffb83888185 upstream.

    Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
    node") improved the previous khugepaged logic which allocated a
    transparent hugepages from the node of the first page being collapsed.

    However, it is still possible to collapse pages to remote memory which
    may suffer from additional access latency. With the current policy, it
    is possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed
    remotely if the majority are allocated from that node.

    When zone_reclaim_mode is enabled, it means the VM should make every
    attempt to allocate locally to prevent NUMA performance degradation. In
    this case, we do not want to collapse hugepages to remote nodes that
    would suffer from increased access latency. Thus, when
    zone_reclaim_mode is enabled, only allow collapsing to nodes with
    RECLAIM_DISTANCE or less.

    There is no functional change for systems that disable
    zone_reclaim_mode.

    Signed-off-by: David Rientjes
    Cc: Dave Hansen
    Cc: Andrea Arcangeli
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     
  • commit c0d73261f5c1355a35b8b40e871d31578ce0c044 upstream.

    Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or
    orig_pte upon which all subsequent decisions and pte_same() tests will
    be made.

    I have no evidence that its lack is responsible for the mm/filemap.c:202
    BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by
    trinity, and I am not optimistic that it will fix it. But I have found
    no other explanation, and ACCESS_ONCE() here will surely not hurt.

    If gcc does re-access the pte before passing it down, then that would be
    disastrous for correct page fault handling, and certainly could explain
    the page_mapped() BUGs seen (concurrent fault causing page to be mapped
    in a second time on top of itself: mapcount 2 for a single pte).

    Signed-off-by: Hugh Dickins
    Cc: Sasha Levin
    Cc: Linus Torvalds
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit 66d2f4d28cd030220e7ea2a628993fcabcb956d1 upstream.

    Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU)
    in isolate_lru_pages() at mm/vmscan.c:1281!

    Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
    cache allocation where possible") looks like interrupted work-in-progress.

    mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's
    - shmem_write_begin() is clearly wrong to use it after shmem_getpage(),
    when the page is always visible in radix_tree, and often already on LRU.

    Revert change to shmem_write_begin(), and use init_page_accessed() or
    mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp().

    SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed()
    before; but since many other filesystems use [__]page_symlink(), which did
    and does mark the page accessed, consider this as rectifying an oversight.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit 888cf2db475a256fb0cda042140f73d7881f81fe upstream.

    If a page is marked for immediate reclaim then it is moved to the tail of
    the LRU list. This occurs when the system is under enough memory pressure
    for pages under writeback to reach the end of the LRU but we test for this
    using atomic operations on every writeback. This patch uses an optimistic
    non-atomic test first. It'll miss some pages in rare cases but the
    consequences are not severe enough to warrant such a penalty.

    While the function does not dominate profiles during a simple dd test the
    cost of it is reduced.

    73048 0.7428 vmlinux-3.15.0-rc5-mmotm-20140513 end_page_writeback
    23740 0.2409 vmlinux-3.15.0-rc5-lessatomic end_page_writeback

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream.

    aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 6fb81a17d21f2a138b8f424af4cf379f2b694060 upstream.

    When adding pages to the LRU we clear the active bit unconditionally.
    As the page could be reachable from other paths we cannot use unlocked
    operations without risk of corruption such as a parallel
    mark_page_accessed. This patch tests if is necessary to clear the
    active flag before using an atomic operation. This potentially opens a
    tiny race when PageActive is checked as mark_page_accessed could be
    called after PageActive was checked. The race already exists but this
    patch changes it slightly. The consequence is that that the page may be
    promoted to the active list that might have been left on the inactive
    list before the patch. It's too tiny a race and too marginal a
    consequence to always use atomic operations for.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit e3741b506c5088fa8c911bb5884c430f770fb49d upstream.

    There should be no references to it any more and a parallel mark should
    not be reordered against us. Use non-locked varient to clear page active.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 07a427884348d38a6fd56fa4d78249c407196650 upstream.

    shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
    before it's even added to the LRU or visible. This is unnecessary as what
    could it possible race against? Use an unlocked variant.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit cfc47a2803db42140167b92d991ef04018e162c7 upstream.

    get_pageblock_migratetype() is called during free with IRQs disabled.
    This is unnecessary and disables IRQs for longer than necessary.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit b745bc85f21ea707e4ea1a91948055fa3e72c77b upstream.

    cold is a bool, make it one. Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit dc4b0caff24d9b2918e9f27bc65499ee63187eba upstream.

    In the free path we calculate page_to_pfn multiple times. Reduce that.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 7aeb09f9104b760fc53c98cb7d20d06640baf9e6 upstream.

    X86 prefers the use of unsigned types for iterators and there is a
    tendency to mix whether a signed or unsigned type if used for page order.
    This converts a number of sites in mm/page_alloc.c to use unsigned int for
    order where possible.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 5dab29113ca56335c78be3f98bf5ddf2ef8eb6a6 upstream.

    ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
    __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these
    cases are relatively rare events but the ALLOC_NO_WATERMARK check is an
    unlikely branch in the fast path. This patch moves the check out of the
    fast path and after it has been determined that the watermarks have not
    been met. This helps the common fast path at the cost of making the slow
    path slower and hitting kswapd with a performance cost. It's a reasonable
    tradeoff.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit a6e21b14f22041382e832d30deda6f26f37b1097 upstream.

    Currently it's calculated once per zone in the zonelist.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit d34c5fa06fade08a689fc171bf756fba2858ae73 upstream.

    A node/zone index is used to check if pages are compatible for merging
    but this happens unconditionally even if the buddy page is not free. Defer
    the calculation as long as possible. Ideally we would check the zone boundary
    but nodes can overlap.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit d8846374a85f4290a473a4e2a64c1ba046c4a0e1 upstream.

    There is no need to calculate zone_idx(preferred_zone) multiple times
    or use the pgdat to figure it out.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 664eeddeef6539247691197c1ac124d4aa872ab6 upstream.

    If cpusets are not in use then we still check a global variable on every
    page allocation. Use jump labels to avoid the overhead.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 800a1e750c7b04c2aa2459afca77e936e01c0029 upstream.

    If a zone cannot be used for a dirty page then it gets marked "full" which
    is cached in the zlc and later potentially skipped by allocation requests
    that have nothing to do with dirty zones.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 65bb371984d6a2c909244eb749e482bb40b72e36 upstream.

    The zlc is used on NUMA machines to quickly skip over zones that are full.
    However it is always updated, even for the first zone scanned when the
    zlc might not even be active. As it's a write to a bitmap that
    potentially bounces cache line it's deceptively expensive and most
    machines will not care. Only update the zlc if it was active.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 2329d3751b082b4fd354f334a88662d72abac52d upstream.

    In mm/swap.c, __lru_cache_add() is exported, but actually there are no
    users outside this file.

    This patch unexports __lru_cache_add(), and makes it static. It also
    exports lru_cache_add_file(), as it is use by cifs and fuse, which can
    loaded as modules.

    Signed-off-by: Jianyu Zhan
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Bob Liu
    Cc: Seth Jennings
    Cc: Joonsoo Kim
    Cc: Rafael Aquini
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Khalid Aziz
    Cc: Christoph Hellwig
    Reviewed-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Jianyu Zhan
     

16 Jan, 2015

3 commits

  • commit 690eac53daff34169a4d74fc7bfbd388c4896abb upstream.

    Commit fee7e49d4514 ("mm: propagate error from stack expansion even for
    guard page") made sure that we return the error properly for stack
    growth conditions. It also theorized that counting the guard page
    towards the stack limit might break something, but also said "Let's see
    if anybody notices".

    Somebody did notice. Apparently android-x86 sets the stack limit very
    close to the limit indeed, and including the guard page in the rlimit
    check causes the android 'zygote' process problems.

    So this adds the (fairly trivial) code to make the stack rlimit check be
    against the actual real stack size, rather than the size of the vma that
    includes the guard page.

    Reported-and-tested-by: Chih-Wei Huang
    Cc: Jay Foad
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit fee7e49d45149fba60156f5b59014f764d3e3728 upstream.

    Jay Foad reports that the address sanitizer test (asan) sometimes gets
    confused by a stack pointer that ends up being outside the stack vma
    that is reported by /proc/maps.

    This happens due to an interaction between RLIMIT_STACK and the guard
    page: when we do the guard page check, we ignore the potential error
    from the stack expansion, which effectively results in a missing guard
    page, since the expected stack expansion won't have been done.

    And since /proc/maps explicitly ignores the guard page (commit
    d7824370e263: "mm: fix up some user-visible effects of the stack guard
    page"), the stack pointer ends up being outside the reported stack area.

    This is the minimal patch: it just propagates the error. It also
    effectively makes the guard page part of the stack limit, which in turn
    measn that the actual real stack is one page less than the stack limit.

    Let's see if anybody notices. We could teach acct_stack_growth() to
    allow an extra page for a grow-up/grow-down stack in the rlimit test,
    but I don't want to add more complexity if it isn't needed.

    Reported-and-tested-by: Jay Foad
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit 9e5e3661727eaf960d3480213f8e87c8d67b6956 upstream.

    Charles Shirron and Paul Cassella from Cray Inc have reported kswapd
    stuck in a busy loop with nothing left to balance, but
    kswapd_try_to_sleep() failing to sleep. Their analysis found the cause
    to be a combination of several factors:

    1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait

    2. The process has been killed (by OOM in this case), but has not yet been
    scheduled to remove itself from the waitqueue and die.

    3. kswapd checks for throttled processes in prepare_kswapd_sleep():

    if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
    wake_up(&pgdat->pfmemalloc_wait);
    return false; // kswapd will not go to sleep
    }

    However, for a process that was already killed, wake_up() does not remove
    the process from the waitqueue, since try_to_wake_up() checks its state
    first and returns false when the process is no longer waiting.

    4. kswapd is running on the same CPU as the only CPU that the process is
    allowed to run on (through cpus_allowed, or possibly single-cpu system).

    5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
    encounters no voluntary preemption points and repeatedly fails
    prepare_kswapd_sleep(), blocking the process from running and removing
    itself from the waitqueue, which would let kswapd sleep.

    So, the source of the problem is that we prevent kswapd from going to
    sleep until there are processes waiting on the pfmemalloc_wait queue,
    and a process waiting on a queue is guaranteed to be removed from the
    queue only when it gets scheduled. This was done to make sure that no
    process is left sleeping on pfmemalloc_wait when kswapd itself goes to
    sleep.

    However, it isn't necessary to postpone kswapd sleep until the
    pfmemalloc_wait queue actually empties. To prevent processes from being
    left sleeping, it's actually enough to guarantee that all processes
    waiting on pfmemalloc_wait queue have been woken up by the time we put
    kswapd to sleep.

    This patch therefore fixes this issue by substituting 'wake_up' with
    'wake_up_all' and removing 'return false' in the code snippet from
    prepare_kswapd_sleep() above. Note that if any process puts itself in
    the queue after this waitqueue_active() check, or after the wake up
    itself, it means that the process will also wake up kswapd - and since
    we are under prepare_to_wait(), the wake up won't be missed. Also we
    update the comment prepare_kswapd_sleep() to hopefully more clearly
    describe the races it is preventing.

    Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage")
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Vladimir Davydov
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

17 Dec, 2014

4 commits

  • commit c4ea95d7cd08d9ffd7fa75e6c5e0332d596dd11e upstream.

    Andrew Morton noticed that the error return from anon_vma_clone() was
    being dropped and replaced with -ENOMEM (which is not itself a bug
    because the only error return value from anon_vma_clone() is -ENOMEM).

    I did an audit of callers of anon_vma_clone() and discovered an actual
    bug where the error return was being lost. In __split_vma(), between
    Linux 3.11 and 3.12 the code was changed so the err variable is used
    before the call to anon_vma_clone() and the default initial value of
    -ENOMEM is overwritten. So a failure of anon_vma_clone() will return
    success since err at this point is now zero.

    Below is a patch which fixes this bug and also propagates the error
    return value from anon_vma_clone() in all cases.

    Fixes: ef0855d334e1 ("mm: mempolicy: turn vma_set_policy() into vma_dup_policy()")
    Signed-off-by: Daniel Forrest
    Reviewed-by: Michal Hocko
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Tim Hartrick
    Cc: Hugh Dickins
    Cc: Michel Lespinasse
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Daniel Forrest
     
  • commit 2022b4d18a491a578218ce7a4eca8666db895a73 upstream.

    I've been seeing swapoff hangs in recent testing: it's cycling around
    trying unsuccessfully to find an mm for some remaining pages of swap.

    I have been exercising swap and page migration more heavily recently,
    and now notice a long-standing error in copy_one_pte(): it's trying to
    add dst_mm to swapoff's mmlist when it finds a swap entry, but is doing
    so even when it's a migration entry or an hwpoison entry.

    Which wouldn't matter much, except it adds dst_mm next to src_mm,
    assuming src_mm is already on the mmlist: which may not be so. Then if
    pages are later swapped out from dst_mm, swapoff won't be able to find
    where to replace them.

    There's already a !non_swap_entry() test for stats: move that up before
    the swap_duplicate() and the addition to mmlist.

    Signed-off-by: Hugh Dickins
    Cc: Kelley Nielsen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit 91b57191cfd152c02ded0745250167d0263084f8 upstream.

    In some android devices, there will be a "divide by zero" exception.
    vmpr->scanned could be zero before spin_lock(&vmpr->sr_lock).

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=88051

    [akpm@linux-foundation.org: neaten]
    Reported-by: ji_ang
    Cc: Anton Vorontsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrew Morton
     
  • commit fb993fa1a2f669215fa03a09eed7848f2663e336 upstream.

    If a frontswap dup-store failed, it should invalidate the expired page
    in the backend, or it could trigger some data corruption issue.
    Such as:
    1. use zswap as the frontswap backend with writeback feature
    2. store a swap page(version_1) to entry A, success
    3. dup-store a newer page(version_2) to the same entry A, fail
    4. use __swap_writepage() write version_2 page to swapfile, success
    5. zswap do shrink, writeback version_1 page to swapfile
    6. version_2 page is overwrited by version_1, data corrupt.

    This patch fixes this issue by invalidating expired data immediately
    when meet a dup-store failure.

    Signed-off-by: Weijie Yang
    Cc: Konrad Rzeszutek Wilk
    Cc: Seth Jennings
    Cc: Dan Streetman
    Cc: Minchan Kim
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Weijie Yang
     

22 Nov, 2014

2 commits

  • commit 5bcc9f86ef09a933255ee66bd899d4601785dad5 upstream.

    For the MIGRATE_RESERVE pages, it is useful when they do not get
    misplaced on free_list of other migratetype, otherwise they might get
    allocated prematurely and e.g. fragment the MIGRATE_RESEVE pageblocks.
    While this cannot be avoided completely when allocating new
    MIGRATE_RESERVE pageblocks in min_free_kbytes sysctl handler, we should
    prevent the misplacement where possible.

    Currently, it is possible for the misplacement to happen when a
    MIGRATE_RESERVE page is allocated on pcplist through rmqueue_bulk() as a
    fallback for other desired migratetype, and then later freed back
    through free_pcppages_bulk() without being actually used. This happens
    because free_pcppages_bulk() uses get_freepage_migratetype() to choose
    the free_list, and rmqueue_bulk() calls set_freepage_migratetype() with
    the *desired* migratetype and not the page's original MIGRATE_RESERVE
    migratetype.

    This patch fixes the problem by moving the call to
    set_freepage_migratetype() from rmqueue_bulk() down to
    __rmqueue_smallest() and __rmqueue_fallback() where the actual page's
    migratetype (e.g. from which free_list the page is taken from) is used.
    Note that this migratetype might be different from the pageblock's
    migratetype due to freepage stealing decisions. This is OK, as page
    stealing never uses MIGRATE_RESERVE as a fallback, and also takes care
    to leave all MIGRATE_CMA pages on the correct freelist.

    Therefore, as an additional benefit, the call to
    get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can
    be removed completely. This relies on the fact that MIGRATE_CMA
    pageblocks are created only during system init, and the above. The
    related is_migrate_isolate() check is also unnecessary, as memory
    isolation has other ways to move pages between freelists, and drain pcp
    lists containing pages that should be isolated. The buffered_rmqueue()
    can also benefit from calling get_freepage_migratetype() instead of
    get_pageblock_migratetype().

    Signed-off-by: Vlastimil Babka
    Reported-by: Yong-Taek Lee
    Reported-by: Bartlomiej Zolnierkiewicz
    Suggested-by: Joonsoo Kim
    Acked-by: Joonsoo Kim
    Suggested-by: Mel Gorman
    Acked-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Marek Szyprowski
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Michal Nazarewicz
    Cc: "Wang, Yalin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813 upstream.

    Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
    ensured that file/anon lists were scanned proportionally for reclaim from
    kswapd but ignored it for direct reclaim. The intent was to minimse
    direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
    long stall for many small stalls and distorts aging for normal workloads
    like streaming readers/writers. Hugh Dickins pointed out that a
    side-effect of the same commit was that when one LRU list dropped to zero
    that the entirety of the other list was shrunk leading to excessive
    reclaim in memcgs. This patch scans the file/anon lists proportionally
    for direct reclaim to similarly age page whether reclaimed by kswapd or
    direct reclaim but takes care to abort reclaim if one LRU drops to zero
    after reclaiming the requested number of pages.

    Based on ext4 and using the Intel VM scalability test

    3.15.0-rc5 3.15.0-rc5
    shrinker proportion
    Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%)
    Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%)
    Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%)
    Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%)
    Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%)
    Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%)

    The test cases are running multiple dd instances reading sparse files. The results are within
    the noise for the small test machine. The impact of the patch is more noticable from the vmstats

    3.15.0-rc5 3.15.0-rc5
    shrinker proportion
    Minor Faults 35154 36784
    Major Faults 611 1305
    Swap Ins 394 1651
    Swap Outs 4394 5891
    Allocation stalls 118616 44781
    Direct pages scanned 4935171 4602313
    Kswapd pages scanned 15921292 16258483
    Kswapd pages reclaimed 15913301 16248305
    Direct pages reclaimed 4933368 4601133
    Kswapd efficiency 99% 99%
    Kswapd velocity 670088.047 682555.961
    Direct efficiency 99% 99%
    Direct velocity 207709.217 193212.133
    Percentage direct scans 23% 22%
    Page writes by reclaim 4858.000 6232.000
    Page writes file 464 341
    Page writes anon 4394 5891

    Note that there are fewer allocation stalls even though the amount
    of direct reclaim scanning is very approximately the same.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tim Chen
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Cc: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman