08 Apr, 2014

40 commits

  • There's only one caller of set_page_dirty_balance() and that will call it
    with page_mkwrite == 0.

    The page_mkwrite argument was unused since commit b827e496c893 "mm: close
    page_mkwrite races".

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
    fuzzing with trinity. The call site try_to_unmap_cluster() does not lock
    the pages other than its check_page parameter (which is already locked).

    The BUG_ON in mlock_vma_page() is not documented and its purpose is
    somewhat unclear, but apparently it serializes against page migration,
    which could otherwise fail to transfer the PG_mlocked flag. This would
    not be fatal, as the page would be eventually encountered again, but
    NR_MLOCK accounting would become distorted nevertheless. This patch adds
    a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
    effect.

    The call site try_to_unmap_cluster() is fixed so that for page !=
    check_page, trylock_page() is attempted (to avoid possible deadlocks as we
    already have check_page locked) and mlock_vma_page() is performed only
    upon success. If the page lock cannot be obtained, the page is left
    without PG_mlocked, which is again not a problem in the whole unevictable
    memory design.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Bob Liu
    Reported-by: Sasha Levin
    Cc: Wanpeng Li
    Cc: Michel Lespinasse
    Cc: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • On NUMA systems, a node may start thrashing cache or even swap anonymous
    pages while there are still free pages on remote nodes.

    This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone
    allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect
    of fair allocation policy").

    Before those changes, the allocator would first try all allowed zones,
    including those on remote nodes, before waking any kswapds. But now,
    the allocator fastpath doubles as the fairness pass, which in turn can
    only consider the local node to prevent remote spilling based on
    exhausted fairness batches alone. Remote nodes are only considered in
    the slowpath, after the kswapds are woken up. But if remote nodes still
    have free memory, kswapd should not be woken to rebalance the local node
    or it may thrash cash or swap prematurely.

    Fix this by adding one more unfair pass over the zonelist that is
    allowed to spill to remote nodes after the local fairness pass fails but
    before entering the slowpath and waking the kswapds.

    This also gets rid of the GFP_THISNODE exemption from the fairness
    protocol because the unfair pass is no longer tied to kswapd, which
    GFP_THISNODE is not allowed to wake up.

    However, because remote spills can be more frequent now - we prefer them
    over local kswapd reclaim - the allocation batches on remote nodes could
    underflow more heavily. When resetting the batches, use
    atomic_long_read() directly instead of zone_page_state() to calculate the
    delta as the latter filters negative counter values.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: [3.12+]

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mem_cgroup_newpage_charge is used only for charging anonymous memory so
    it is better to rename it to mem_cgroup_charge_anon.

    mem_cgroup_cache_charge is used for file backed memory so rename it to
    mem_cgroup_charge_file.

    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Some callsites pass a memcg directly, some callsites pass an mm that
    then has to be translated to a memcg. This makes for a terrible
    function interface.

    Just push the mm-to-memcg translation into the respective callsites and
    always pass a memcg to mem_cgroup_try_charge().

    [mhocko@suse.cz: add charge mm helper]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • __mem_cgroup_try_charge duplicates get_mem_cgroup_from_mm for charges
    which came without a memcg. The only reason seems to be a tiny
    optimization when css_tryget is not called if the charge can be consumed
    from the stock. Nevertheless css_tryget is very cheap since it has been
    reworked to use per-cpu counting so this optimization doesn't give us
    anything these days.

    So let's drop the code duplication so that the code is more readable.

    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Instead of returning NULL from try_get_mem_cgroup_from_mm() when the mm
    owner is exiting, just return root_mem_cgroup. This makes sense for all
    callsites and gets rid of some of them having to fallback manually.

    [fengguang.wu@intel.com: fix warnings]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Fengguang Wu
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Users pass either a mm that has been established under task lock, or use
    a verified current->mm, which means the task can't be exiting.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Only page cache charges can happen without an mm context, so push this
    special case out of the inner core and into the cache charge function.

    An ancient comment explains that the mm can also be NULL in case the
    task is currently being migrated, but that is not actually true with the
    current case, so just remove it.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mem_cgroup_charge_common() is used by both cache and anon pages, but
    most of its body only applies to anon pages and the remainder is not
    worth having in a separate function.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • It used to disable preemption and run sanity checks but now it's only
    taking a number out of one percpu counter and putting it into another.
    Do this directly in the callsite and save the indirection.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • lock_page_cgroup() disables preemption, remove explicit preemption
    disabling for code paths holding this lock.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • I tried to use 'dump_page(page, __func__)' for debugging, but it triggers
    warning:

    warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default]

    Let's convert 'reason' to 'const char *' in dump_page() and friends: we
    shouldn't modify it anyway.

    Signed-off-by: Kirill A. Shutemov
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • vm_map_ram() has a fragmentation problem when it cannot purge a
    chunk(ie, 4M address space) if there is a pinning object in that
    addresss space. So it could consume all VMALLOC address space easily.

    We can fix the fragmentation problem by using vmap instead of
    vm_map_ram() but vmap() is known to be slow compared to vm_map_ram().
    Minchan said vm_map_ram is 5 times faster than vmap in his tests. So I
    thought we should fix fragment problem of vm_map_ram because our
    proprietary GPU driver has used it heavily.

    On second thought, it's not an easy because we should reuse freed space
    for solving the problem and it could make more IPI and bitmap operation
    for searching hole. It could mitigate API's goal which is very fast
    mapping. And even fragmentation problem wouldn't show in 64 bit
    machine.

    Another option is that the user should separate long-life and short-life
    object and use vmap for long-life but vm_map_ram for short-life. If we
    inform the user about the characteristic of vm_map_ram the user can
    choose one according to the page lifetime.

    Let's add some notice messages to user.

    [akpm@linux-foundation.org: tweak comment text]
    Signed-off-by: Gioh Kim
    Reviewed-by: Zhang Yanfei
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gioh Kim
     
  • Signed-off-by: Choi Gi-yong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Choi Gi-yong
     
  • Add unlikely and likely hints to the function mempool_free. It lays out
    the code in such a way that the common path is executed straighforward and
    saves a cache line.

    Signed-off-by: Mikulas Patocka
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • The conditions that control the isolation mode in
    isolate_migratepages_range() do not change during the iteration, so
    extract them out and only define the value once.

    This actually does have an effect, gcc doesn't optimize it itself because
    of cc->sync.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The res_counter_{charge,uncharge}_locked() variants are not used in the
    kernel outside of the resource counter code itself, so remove the
    interface.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
    There's no significant performance degradation to checking
    current->mempolicy rather than current->flags & PF_MEMPOLICY in the
    allocation path, especially since this is considered unlikely().

    Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
    64GB of memory and without a mempolicy:

    threads before after
    16 1249409 1244487
    32 1281786 1246783
    48 1239175 1239138
    64 1244642 1241841
    80 1244346 1248918
    96 1266436 1254316
    112 1307398 1312135
    128 1327607 1326502

    Per-process flags are a scarce resource so we should free them up whenever
    possible and make them available. We'll be using it shortly for memcg oom
    reserves.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • slab_node() is actually a mempolicy function, so rename it to
    mempolicy_slab_node() to make it clearer that it used for processes with
    mempolicies.

    At the same time, cleanup its code by saving numa_mem_id() in a local
    variable (since we require a node with memory, not just any node) and
    remove an obsolete comment that assumes the mempolicy is actually passed
    into the function.

    Signed-off-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • copy_flags() does not use the clone_flags formal and can be collapsed
    into copy_process() for cleaner code.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • To increase compiler portability there is which
    provides convenience macros for various gcc constructs. Eg: __weak for
    __attribute__((weak)). I've replaced all instances of gcc attributes with
    the right macro in the memory management (/mm) subsystem.

    [akpm@linux-foundation.org: while-we're-there consistency tweaks]
    Signed-off-by: Gideon Israel Dsouza
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza
     
  • This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • In shmem/tmpfs, we also use the generic filemap_map_pages, seems the
    additional checking is not worth a separate version of map_pages for it.

    Signed-off-by: Ning Qu
    Acked-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Dave Hansen
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ning Qu
     
  • Let's allow people to tweak faultaround at runtime.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Ning Qu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Minor cleanups:
    - 'size' variable is now in bytes, not pages;
    - use round_up(): it should be easier to read.

    Signed-off-by: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Ning Qu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • filemap_map_pages() is generic implementation of ->map_pages() for
    filesystems who uses page cache.

    It should be safe to use filemap_map_pages() for ->map_pages() if
    filesystem use filemap_fault() for ->fault().

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Ning Qu
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Here's new version of faultaround patchset. It took a while to tune it
    and collect performance data.

    First patch adds new callback ->map_pages to vm_operations_struct.

    ->map_pages() is called when VM asks to map easy accessible pages.
    Filesystem should find and map pages associated with offsets from
    "pgoff" till "max_pgoff". ->map_pages() is called with page table
    locked and must not block. If it's not possible to reach a page without
    blocking, filesystem should skip it. Filesystem should use do_set_pte()
    to setup page table entry. Pointer to entry associated with offset
    "pgoff" is passed in "pte" field in vm_fault structure. Pointers to
    entries for other offsets should be calculated relative to "pte".

    Currently VM use ->map_pages only on read page fault path. We try to
    map FAULT_AROUND_PAGES a time. FAULT_AROUND_PAGES is 16 for now.
    Performance data for different FAULT_AROUND_ORDER is below.

    TODO:
    - implement ->map_pages() for shmem/tmpfs;
    - modify get_user_pages() to be able to use ->map_pages() and implement
    mmap(MAP_POPULATE|MAP_NONBLOCK) on top.

    =========================================================================
    Tested on 4-socket machine (120 threads) with 128GiB of RAM.

    Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
    somewhere between 3 and 5. Let's say 4 :)

    Linux build (make -j60)
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    minor-faults 283,301,572 247,151,987 212,215,789 204,772,882 199,568,944 194,703,779 193,381,485
    time, seconds 151.227629483 153.920996480 151.356125472 150.863792049 150.879207877 151.150764954 151.450962358
    Linux rebuild (make -j60)
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    minor-faults 5,396,854 4,148,444 2,855,286 2,577,282 2,361,957 2,169,573 2,112,643
    time, seconds 27.404543757 27.559725591 27.030057426 26.855045126 26.678618635 26.974523490 26.761320095
    Git test suite (make -j60 test)
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    minor-faults 129,591,823 99,200,751 66,106,718 57,606,410 51,510,808 45,776,813 44,085,515
    time, seconds 66.087215026 64.784546905 64.401156567 65.282708668 66.034016829 66.793780811 67.237810413

    Two synthetic tests: access every word in file in sequential/random order.
    It doesn't improve much after FAULT_AROUND_ORDER == 4.

    Sequential access 16GiB file
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    1 thread
    minor-faults 4,195,437 2,098,275 525,068 262,251 131,170 32,856 8,282
    time, seconds 7.250461742 6.461711074 5.493859139 5.488488147 5.707213983 5.898510832 5.109232856
    8 threads
    minor-faults 33,557,540 16,892,728 4,515,848 2,366,999 1,423,382 442,732 142,339
    time, seconds 16.649304881 9.312555263 6.612490639 6.394316732 6.669827501 6.75078944 6.371900528
    32 threads
    minor-faults 134,228,222 67,526,810 17,725,386 9,716,537 4,763,731 1,668,921 537,200
    time, seconds 49.164430543 29.712060103 12.938649729 10.175151004 11.840094583 9.594081325 9.928461797
    60 threads
    minor-faults 251,687,988 126,146,952 32,919,406 18,208,804 10,458,947 2,733,907 928,217
    time, seconds 86.260656897 49.626551828 22.335007632 17.608243696 16.523119035 16.339489186 16.326390902
    120 threads
    minor-faults 503,352,863 252,939,677 67,039,168 35,191,827 19,170,091 4,688,357 1,471,862
    time, seconds 124.589206333 79.757867787 39.508707872 32.167281632 29.972989292 28.729834575 28.042251622
    Random access 1GiB file
    1 thread
    minor-faults 262,636 132,743 34,369 17,299 8,527 3,451 1,222
    time, seconds 15.351890914 16.613802482 16.569227308 15.179220992 16.557356122 16.578247824 15.365266994
    8 threads
    minor-faults 2,098,948 1,061,871 273,690 154,501 87,110 25,663 7,384
    time, seconds 15.040026343 15.096933500 14.474757288 14.289129964 14.411537468 14.296316837 14.395635804
    32 threads
    minor-faults 8,390,734 4,231,023 1,054,432 528,847 269,242 97,746 26,881
    time, seconds 20.430433109 21.585235358 22.115062928 14.872878951 14.880856305 14.883370649 14.821261690
    60 threads
    minor-faults 15,733,258 7,892,809 1,973,393 988,266 594,789 164,994 51,691
    time, seconds 26.577302548 25.692397770 18.728863715 20.153026398 21.619101933 17.745086260 17.613215273
    120 threads
    minor-faults 31,471,111 15,816,616 3,959,209 1,978,685 1,008,299 264,635 96,010
    time, seconds 41.835322703 40.459786095 36.085306105 35.313894834 35.814445675 36.552633793 34.289210594

    Touch only one page in page table in 16GiB file
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    1 thread
    minor-faults 8,372 8,324 8,270 8,260 8,249 8,239 8,237
    time, seconds 0.039892712 0.045369149 0.051846126 0.063681685 0.079095975 0.17652406 0.541213386
    8 threads
    minor-faults 65,731 65,681 65,628 65,620 65,608 65,599 65,596
    time, seconds 0.124159196 0.488600638 0.156854426 0.191901957 0.242631486 0.543569456 1.677303984
    32 threads
    minor-faults 262,388 262,341 262,285 262,276 262,266 262,257 263,183
    time, seconds 0.452421421 0.488600638 0.565020946 0.648229739 0.789850823 1.651584361 5.000361559
    60 threads
    minor-faults 491,822 491,792 491,723 491,711 491,701 491,691 491,825
    time, seconds 0.763288616 0.869620515 0.980727360 1.161732354 1.466915814 3.04041448 9.308612938
    120 threads
    minor-faults 983,466 983,655 983,366 983,372 983,363 984,083 984,164
    time, seconds 1.595846553 1.667902182 2.008959376 2.425380942 2.941368804 5.977807890 18.401846125

    This patch (of 2):

    Introduce new vm_ops callback ->map_pages() and uses it for mapping easy
    accessible pages around fault address.

    On read page fault, if filesystem provides ->map_pages(), we try to map up
    to FAULT_AROUND_PAGES pages around page fault address in hope to reduce
    number of minor page faults.

    We call ->map_pages first and use ->fault() as fallback if page by the
    offset is not ready to be mapped (cold page cache or something).

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Ning Qu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • "mm: introduce vm_ops->map_pages()" wants to export a do_set_pte() from core
    kernel. Rename lguest's do_set_pte() to something more lguest-specific.

    Cc: "Kirill A. Shutemov"
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • After this patch 'page-types' can walk over a file's mappings and
    analyze populated page cache pages mostly without disturbing its state.

    It maps chunk of file, marks VMA as MADV_RANDOM to turn off readahead,
    pokes VMA via mincore() to determine cached pages, triggers page-fault
    only for them, and finally gathers information via pagemap/kpageflags.
    Before unmap it marks VMA as MADV_SEQUENTIAL for ignoring reference
    bits.

    usage: page-types -f

    If is directory it will analyse all files in all subdirectories.

    Symlinks are not followed as well as mount points. Hardlinks aren't
    handled, they'll be dumped as many times as they are found. Recursive
    walk brings all dentries into dcache and populates page cache of
    block-devices aka 'Buffers'.

    Probably it's worth to add ioctl for dumping file page cache as array of
    PFNs as a replacement for this hackish juggling with
    mmap/madvise/mincore/pagemap. Also recursive walk could be replaced
    with dumping cached inodes via some ioctl or debugfs interface followed
    by openning them via open_by_handle_at, this would fix hardlinks
    handling and unneeded population of dcache and buffers. This interface
    might be used as data source for constructing readahead plans and for
    background optimizations of actively used files.

    collateral changes:
    + fix 64-bit LFS: define _FILE_OFFSET_BITS instead of _LARGEFILE64_SOURCE
    + replace lseek + read with single pread
    + make show_page_range() reusable after flush

    usage example:

    ~/src/linux/tools/vm$ sudo ./page-types -L -f page-types
    foffset offset flags
    page-types Inode: 2229277 Size: 89065 (22 pages)
    Modify: Tue Feb 25 12:00:59 2014 (162 seconds ago)
    Access: Tue Feb 25 12:01:00 2014 (161 seconds ago)
    0 3cbf3b __RU_lA____M________________________
    1 38946a __RU_lA____M________________________
    2 1a3cec __RU_lA____M________________________
    3 1a8321 __RU_lA____M________________________
    4 3af7cc __RU_lA____M________________________
    5 1ed532 __RU_lA_____________________________
    6 2e436a __RU_lA_____________________________
    7 29a35e ___U_lA_____________________________
    8 2de86e ___U_lA_____________________________
    9 3bdfb4 ___U_lA_____________________________
    10 3cd8a3 ___U_lA_____________________________
    11 2afa50 ___U_lA_____________________________
    12 2534c2 ___U_lA_____________________________
    13 1b7a40 ___U_lA_____________________________
    14 17b0be ___U_lA_____________________________
    15 392b0c ___U_lA_____________________________
    16 3ba46a __RU_lA_____________________________
    17 397dc8 ___U_lA_____________________________
    18 1f2a36 ___U_lA_____________________________
    19 21fd30 __RU_lA_____________________________
    20 2c35ba __RU_l______________________________
    21 20f181 __RU_l______________________________

    flags page-count MB symbolic-flags long-symbolic-flags
    0x000000000000002c 2 0 __RU_l______________________________ referenced,uptodate,lru
    0x0000000000000068 11 0 ___U_lA_____________________________ uptodate,lru,active
    0x000000000000006c 4 0 __RU_lA_____________________________ referenced,uptodate,lru,active
    0x000000000000086c 5 0 __RU_lA____M________________________ referenced,uptodate,lru,active,mmap
    total 22 0

    ~/src/linux/tools/vm$ sudo ./page-types -f /
    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000000000028 21761 85 ___U_l______________________________ uptodate,lru
    0x000000000000002c 127279 497 __RU_l______________________________ referenced,uptodate,lru
    0x0000000000000068 74160 289 ___U_lA_____________________________ uptodate,lru,active
    0x000000000000006c 84469 329 __RU_lA_____________________________ referenced,uptodate,lru,active
    0x000000000000007c 1 0 __RUDlA_____________________________ referenced,uptodate,dirty,lru,active
    0x0000000000000228 370 1 ___U_l___I__________________________ uptodate,lru,reclaim
    0x0000000000000828 49 0 ___U_l_____M________________________ uptodate,lru,mmap
    0x000000000000082c 126 0 __RU_l_____M________________________ referenced,uptodate,lru,mmap
    0x0000000000000868 137 0 ___U_lA____M________________________ uptodate,lru,active,mmap
    0x000000000000086c 12890 50 __RU_lA____M________________________ referenced,uptodate,lru,active,mmap
    total 321242 1254

    Signed-off-by: Konstantin Khlebnikov
    Cc: Arnaldo Carvalho de Melo
    Cc: Fengguang Wu
    Cc: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • There's no reason to enable split page table lock if don't have page
    tables.

    It also triggers build error at least on ARM since we don't define
    pmd_page() for !MMU.

    In file included from arch/arm/kernel/asm-offsets.c:14:0:
    include/linux/mm.h: In function 'pte_lockptr':
    include/linux/mm.h:1392:2: error: implicit declaration of function 'pmd_page' [-Werror=implicit-function-declaration]
    include/linux/mm.h:1392:2: warning: passing argument 1 of 'ptlock_ptr' makes pointer from integer without a cast [enabled by default]
    include/linux/mm.h:1384:27: note: expected 'struct page *' but argument is of type 'int'

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Uwe Kleine-König
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • load_elf_binary() sets current->mm->def_flags = def_flags and def_flags
    is always zero. Not only this looks strange, this is unnecessary
    because mm_init() has already set ->def_flags = 0.

    Signed-off-by: Alex Thorlton
    Suggested-by: Oleg Nesterov
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Christian Borntraeger
    Cc: Paolo Bonzini
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Thorlton
     
  • Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It
    also adds a prctl control which allows us to set the THP disable bit in
    mm->def_flags so that VMs will pick up the setting as they are created.

    Signed-off-by: Alex Thorlton
    Suggested-by: Oleg Nesterov
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Christian Borntraeger
    Cc: Paolo Bonzini
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Thorlton
     
  • The main motivation behind this patch is to provide a way to disable THP
    for jobs where the code cannot be modified, and using a malloc hook with
    madvise is not an option (i.e. statically allocated data). This patch
    allows us to do just that, without affecting other jobs running on the
    system.

    We need to do this sort of thing for jobs where THP hurts performance,
    due to the possibility of increased remote memory accesses that can be
    created by situations such as the following:

    When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
    be handed out, and the THP will be stuck on whatever node the chunk was
    originally referenced from. If many remote nodes need to do work on
    that same chunk, they'll be making remote accesses.

    With THP disabled, 4K pages can be handed out to separate nodes as
    they're needed, greatly reducing the amount of remote accesses to
    memory.

    This patch is based on some of my work combined with some
    suggestions/patches given by Oleg Nesterov. The main goal here is to
    add a prctl switch to allow us to disable to THP on a per mm_struct
    basis.

    Here's a bit of test data with the new patch in place...

    First with the flag unset:

    # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
    Setting thp_disabled for this task...
    thp_disable: 0
    Set thp_disabled state to 0
    Process pid = 18027

    PF/
    MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
    TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
    512 1.120 0.060 0.000 0.110 0.110 0.000 28571428864 -9223372036854775808 55803572 23

    Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

    273719072.841402 task-clock # 641.026 CPUs utilized [100.00%]
    1,008,986 context-switches # 0.000 M/sec [100.00%]
    7,717 CPU-migrations # 0.000 M/sec [100.00%]
    1,698,932 page-faults # 0.000 M/sec
    355,222,544,890,379 cycles # 1.298 GHz [100.00%]
    536,445,412,234,588 stalled-cycles-frontend # 151.02% frontend cycles idle [100.00%]
    409,110,531,310,223 stalled-cycles-backend # 115.17% backend cycles idle [100.00%]
    148,286,797,266,411 instructions # 0.42 insns per cycle
    # 3.62 stalled cycles per insn [100.00%]
    27,061,793,159,503 branches # 98.867 M/sec [100.00%]
    1,188,655,196 branch-misses # 0.00% of all branches

    427.001706337 seconds time elapsed

    Now with the flag set:

    # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
    Setting thp_disabled for this task...
    thp_disable: 1
    Set thp_disabled state to 1
    Process pid = 144957

    PF/
    MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
    TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
    512 0.620 0.260 0.250 0.320 0.570 0.001 51612901376 128000000000 100806448 23

    Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

    138789390.540183 task-clock # 641.959 CPUs utilized [100.00%]
    534,205 context-switches # 0.000 M/sec [100.00%]
    4,595 CPU-migrations # 0.000 M/sec [100.00%]
    63,133,119 page-faults # 0.000 M/sec
    147,977,747,269,768 cycles # 1.066 GHz [100.00%]
    200,524,196,493,108 stalled-cycles-frontend # 135.51% frontend cycles idle [100.00%]
    105,175,163,716,388 stalled-cycles-backend # 71.07% backend cycles idle [100.00%]
    180,916,213,503,160 instructions # 1.22 insns per cycle
    # 1.11 stalled cycles per insn [100.00%]
    26,999,511,005,868 branches # 194.536 M/sec [100.00%]
    714,066,351 branch-misses # 0.00% of all branches

    216.196778807 seconds time elapsed

    As with previous versions of the patch, We're getting about a 2x
    performance increase here. Here's a link to the test case I used, along
    with the little wrapper to activate the flag:

    http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz

    This patch (of 3):

    Revert commit 8e72033f2a48 and add in code to fix up any issues caused
    by the revert.

    The revert is necessary because hugepage_madvise would return -EINVAL
    when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
    patch set.

    Here's a snip of an e-mail from Gerald detailing the original purpose of
    this code, and providing justification for the revert:

    "The intent of commit 8e72033f2a48 was to guard against any future
    programming errors that may result in an madvice(MADV_HUGEPAGE) on
    guest mappings, which would crash the kernel.

    Martin suggested adding the bit to arch/s390/mm/pgtable.c, if
    8e72033f2a48 was to be reverted, because that check will also prevent
    a kernel crash in the case described above, it will now send a
    SIGSEGV instead.

    This would now also allow to do the madvise on other parts, if
    needed, so it is a more flexible approach. One could also say that
    it would have been better to do it this way right from the
    beginning..."

    Signed-off-by: Alex Thorlton
    Suggested-by: Oleg Nesterov
    Tested-by: Christian Borntraeger
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Paolo Bonzini
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Thorlton
     
  • It is just for clean-up to reduce code size and improve readability.
    There is no functional change.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • isolation_suitable() and migrate_async_suitable() is used to be sure
    that this pageblock range is fine to be migragted. It isn't needed to
    call it on every page. Current code do well if not suitable, but, don't
    do well when suitable.

    1) It re-checks isolation_suitable() on each page of a pageblock that was
    already estabilished as suitable.
    2) It re-checks migrate_async_suitable() on each page of a pageblock that
    was not entered through the next_pageblock: label, because
    last_pageblock_nr is not otherwise updated.

    This patch fixes situation by 1) calling isolation_suitable() only once
    per pageblock and 2) always updating last_pageblock_nr to the pageblock
    that was just checked.

    Additionally, move PageBuddy() check after pageblock unit check, since
    pageblock check is the first thing we should do and makes things more
    simple.

    [vbabka@suse.cz: rephrase commit description]
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
    pfn page. This may results in below situation while isolating
    migratepage.

    1. try isolate 0x0 ~ 0x200 pfn pages.
    2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
    the spinlock.
    3. Then, to complete isolating, retry to aquire the lock.

    I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
    criteria about dropping the lock. This has no harm 0x0 pfn, because, at
    this time, locked variable would be false.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • suitable_migration_target() checks that pageblock is suitable for
    migration target. In isolate_freepages_block(), it is called on every
    page and this is inefficient. So make it called once per pageblock.

    suitable_migration_target() also checks if page is highorder or not, but
    it's criteria for highorder is pageblock order. So calling it once
    within pageblock range has no problem.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Purpose of compaction is to get a high order page. Currently, if we
    find high-order page while searching migration target page, we break it
    to order-0 pages and use them as migration target. It is contrary to
    purpose of compaction, so disallow high-order page to be used for
    migration target.

    Additionally, clean-up logic in suitable_migration_target() to simplify
    the code. There is no functional changes from this clean-up.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We had a report about strange OOM killer strikes on a PPC machine
    although there was a lot of swap free and a tons of anonymous memory
    which could be swapped out. In the end it turned out that the OOM was a
    side effect of zone reclaim which wasn't unmapping and swapping out and
    so the system was pushed to the OOM. Although this sounds like a bug
    somewhere in the kswapd vs. zone reclaim vs. direct reclaim
    interaction numactl on the said hardware suggests that the zone reclaim
    should not have been set in the first place:

    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 2 cpus:
    node 2 size: 7168 MB
    node 2 free: 6019 MB
    node distances:
    node 0 2
    0: 10 40
    2: 40 10

    So all the CPUs are associated with Node0 which doesn't have any memory
    while Node2 contains all the available memory. Node distances cause an
    automatic zone_reclaim_mode enabling.

    Zone reclaim is intended to keep the allocations local but this doesn't
    make any sense on the memoryless nodes. So let's exclude such nodes for
    init_zone_allows_reclaim which evaluates zone reclaim behavior and
    suitable reclaim_nodes.

    Signed-off-by: Michal Hocko
    Acked-by: David Rientjes
    Acked-by: Nishanth Aravamudan
    Tested-by: Nishanth Aravamudan
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko