30 May, 2012

20 commits

  • Add a Kconfig option to allow people who don't want cross memory attach to
    not have it included in their build.

    Signed-off-by: Chris Yeoh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     
  • mm->page_table_lock is hotly contested for page fault tests and isn't
    necessary to do mem_cgroup_uncharge_page() in do_huge_pmd_wp_page().

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Andrew pointed out that the is_mlocked_vma() is misnamed. A function
    with name like that would expect bool return and no side-effects.

    Since it is called on the fault path for new page, rename it in this
    patch.

    Signed-off-by: Ying Han
    Reviewed-by: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    [akpm@linux-foundation.org: s/mlock_vma_newpage/mlock_vma_newpage/, per Minchan]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The rmap walker checking page table references has historically ignored
    references from VMAs that were not part of the memcg that was being
    reclaimed during memcg hard limit reclaim.

    When transitioning global reclaim to memcg hierarchy reclaim, I missed
    that bit and now references from outside a memcg are ignored even during
    global reclaim.

    Reverting back to traditional behaviour - count all references during
    global reclaim and only mind references of the memcg being reclaimed
    during limit reclaim would be one option.

    However, the more generic idea is to ignore references exactly then when
    they are outside the hierarchy that is currently under reclaim; because
    only then will their reclamation be of any use to help the pressure
    situation. It makes no sense to ignore references from a sibling memcg
    and then evict a page that will be immediately refaulted by that sibling
    which contributes to the same usage of the common ancestor under
    reclaim.

    The solution: make the rmap walker ignore references from VMAs that are
    not part of the hierarchy that is being reclaimed.

    Flat limit reclaim will stay the same, hierarchical limit reclaim will
    mind the references only to pages that the hierarchy owns. Global
    reclaim, since it reclaims from all memcgs, will be fixed to regard all
    references.

    [akpm@linux-foundation.org: name the args in the declaration]
    Signed-off-by: Johannes Weiner
    Reported-by: Konstantin Khlebnikov
    Acked-by: Konstantin Khlebnikov
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Li Zefan
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Library functions should not grab locks when the callsites can do it,
    even if the lock nests like the rcu read-side lock does.

    Push the rcu_read_lock() from css_is_ancestor() to its single user,
    mem_cgroup_same_or_subtree() in preparation for another user that may
    already hold the rcu read-side lock.

    Signed-off-by: Johannes Weiner
    Cc: Konstantin Khlebnikov
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • s/from_nodes/from and s/to_nodes/to/. The "_nodes" is redundant - it
    duplicates the argument's type.

    Done in a fit of irritation over 80-col issues :(

    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • While running an application that moves tasks from one cpuset to another
    I noticed that it takes much longer and moves many more pages than
    expected.

    The reason for this is do_migrate_pages() does its best to preserve the
    relative node differential from the first node of the cpuset because the
    application may have been written with that in mind. If memory was
    interleaved on the nodes of the source cpuset by an application
    do_migrate_pages() will try its best to maintain that interleaving on
    the nodes of the destination cpuset. This means copying the memory from
    all source nodes to the destination nodes even if the source and
    destination nodes overlap.

    This is a problem for userspace NUMA placement tools. The amount of
    time spent doing extra memory moves cancels out some of the NUMA
    performance improvements. Furthermore, if the number of source and
    destination nodes are to maintain the previous interleaving layout
    anyway.

    This patch changes do_migrate_pages() to only preserve the relative
    layout inside the program if the number of NUMA nodes in the source and
    destination mask are the same. If the number is different, we do a much
    more efficient migration by not touching memory that is in an allowed
    node.

    This preserves the old behaviour for programs that want it, while
    allowing a userspace NUMA placement tool to use the new, faster
    migration. This improves performance in our tests by up to a factor of
    7.

    Without this change migrating tasks from a cpuset containing nodes 0-7
    to a cpuset containing nodes 3-4, we migrate from ALL the nodes even if
    they are in the both the source and destination nodesets:

    Migrating 7 to 4
    Migrating 6 to 3
    Migrating 5 to 4
    Migrating 4 to 3
    Migrating 1 to 4
    Migrating 3 to 4
    Migrating 0 to 3
    Migrating 2 to 3

    With this change we only migrate from nodes that are not in the
    destination nodesets:

    Migrating 7 to 4
    Migrating 6 to 3
    Migrating 5 to 4
    Migrating 2 to 3
    Migrating 1 to 4
    Migrating 0 to 3

    Yet if we move from a cpuset containing nodes 2,3,4 to a cpuset
    containing 3,4,5 we still do move everything so that we preserve the
    desired NUMA offsets:

    Migrating 4 to 5
    Migrating 3 to 4
    Migrating 2 to 3

    As far as performance is concerned this simple patch improves the time
    it takes to move 14, 20 and 26 large tasks from a cpuset containing
    nodes 0-7 to a cpuset containing nodes 1 & 3 by up to a factor of 7.
    Here are the timings with and without the patch:

    BEFORE PATCH -- Move times: 59, 140, 651 seconds
    ============

    Moving 14 tasks from nodes (0-7) to nodes (1,3)
    numad(8780) do_migrate_pages (mm=0xffff88081d414400
    from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x7 dest=0x3 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x6 dest=0x1 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x5 dest=0x3 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x4 dest=0x1 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x2 dest=0x1 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x1 dest=0x3 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x0 dest=0x1 flags=0x4)
    (Above moves repeated for each of the 14 tasks...)
    PID 8890 moved to node(s) 1,3 in 59.2 seconds

    Moving 20 tasks from nodes (0-7) to nodes (1,4-5)
    numad(8780) do_migrate_pages (mm=0xffff88081d88c700
    from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x7 dest=0x4 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x6 dest=0x1 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x3 dest=0x1 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x2 dest=0x5 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x1 dest=0x4 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x0 dest=0x1 flags=0x4)
    (Above moves repeated for each of the 20 tasks...)
    PID 8962 moved to node(s) 1,4-5 in 139.88 seconds

    Moving 26 tasks from nodes (0-7) to nodes (1-3,5)
    numad(8780) do_migrate_pages (mm=0xffff88081d5bc740
    from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x7 dest=0x5 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x6 dest=0x3 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x5 dest=0x2 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x3 dest=0x5 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x2 dest=0x3 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x1 dest=0x2 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x0 dest=0x1 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x4 dest=0x1 flags=0x4)
    (Above moves repeated for each of the 26 tasks...)
    PID 9058 moved to node(s) 1-3,5 in 651.45 seconds

    AFTER PATCH -- Move times: 42, 56, 93 seconds
    ===========

    Moving 14 tasks from nodes (0-7) to nodes (5,7)
    numad(33209) do_migrate_pages (mm=0xffff88101d5ff140
    from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x6 dest=0x5 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x4 dest=0x5 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x3 dest=0x7 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x2 dest=0x5 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x1 dest=0x7 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x0 dest=0x5 flags=0x4)
    (Above moves repeated for each of the 14 tasks...)
    PID 33221 moved to node(s) 5,7 in 41.67 seconds

    Moving 20 tasks from nodes (0-7) to nodes (1,3,5)
    numad(33209) do_migrate_pages (mm=0xffff88101d6c37c0
    from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x7 dest=0x3 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x6 dest=0x1 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x4 dest=0x3 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x2 dest=0x5 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x0 dest=0x1 flags=0x4)
    (Above moves repeated for each of the 20 tasks...)
    PID 33289 moved to node(s) 1,3,5 in 56.3 seconds

    Moving 26 tasks from nodes (0-7) to nodes (1,3,5,7)
    numad(33209) do_migrate_pages (mm=0xffff88101d924400
    from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x6 dest=0x5 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x4 dest=0x1 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x2 dest=0x5 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x0 dest=0x1 flags=0x4)
    (Above moves repeated for each of the 26 tasks...)
    PID 33372 moved to node(s) 1,3,5,7 in 92.67 seconds

    [akpm@linux-foundation.org: clean up comment layout]
    Signed-off-by: Larry Woodman
    Cc: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Larry Woodman
     
  • On COW, a new hugepage is allocated and charged to the memcg. If the
    system is oom or the charge to the memcg fails, however, the fault
    handler will return VM_FAULT_OOM which results in an oom kill.

    Instead, it's possible to fallback to splitting the hugepage so that the
    COW results only in an order-0 page being allocated and charged to the
    memcg which has a higher liklihood to succeed. This is expensive
    because the hugepage must be split in the page fault handler, but it is
    much better than unnecessarily oom killing a process.

    Signed-off-by: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • …g_debug_root dentry local

    Remove debug fs files and directory on failure. Since no one is using
    "extfrag_debug_root" dentry outside of extfrag_debug_init(), make it
    local to the function.

    Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com>
    Acked-by: David Rientjes <rientjes@google.com>
    Acked-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Sasikantha babu
     
  • The "if (mm)" check is not required in find_vma, as the kernel code
    calls find_vma only when it is absolutely sure that the mm_struct arg to
    it is non-NULL.

    Remove the if(mm) check and adding the a WARN_ONCE(!mm) for now. This
    will serve the purpose of mandating that the execution
    context(user-mode/kernel-mode) be known before find_vma is called. Also
    fixed 2 checkpatch.pl errors in the declaration of the rb_node and
    vma_tmp local variables.

    I was browsing through the internet and read a discussion at
    https://lkml.org/lkml/2012/3/27/342 which discusses removal of the
    validation check within find_vma. Since no-one responded, I decided to
    send this patch with Andrew's suggestions.

    [akpm@linux-foundation.org: add remove-me comment]
    Signed-off-by: Rajman Mekaco
    Cc: Kautuk Consul
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rajman Mekaco
     
  • The advantage of kcalloc is, that will prevent integer overflows which
    could result from the multiplication of number of elements and size and
    it is also a bit nicer to read.

    The semantic patch that makes this change is available in
    https://lkml.org/lkml/2011/11/25/107

    Signed-off-by: Thomas Meyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Meyer
     
  • There is little motiviation for reclaim_mode_t once RECLAIM_MODE_[A]SYNC
    and lumpy reclaim have been removed. This patch gets rid of
    reclaim_mode_t as well and improves the documentation about what
    reclaim/compaction is and when it is triggered.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Ying Han
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch stops reclaim/compaction entering sync reclaim as this was
    only intended for lumpy reclaim and an oversight. Page migration has
    its own logic for stalling on writeback pages if necessary and memory
    compaction is already using it.

    Waiting on page writeback is bad for a number of reasons but the primary
    one is that waiting on writeback to a slow device like USB can take a
    considerable length of time. Page reclaim instead uses
    wait_iff_congested() to throttle if too many dirty pages are being
    scanned.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Ying Han
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This series removes lumpy reclaim and some stalling logic that was
    unintentionally being used by memory compaction. The end result is that
    stalling on dirty pages during page reclaim now depends on
    wait_iff_congested().

    Four kernels were compared

    3.3.0 vanilla
    3.4.0-rc2 vanilla
    3.4.0-rc2 lumpyremove-v2 is patch one from this series
    3.4.0-rc2 nosync-v2r3 is the full series

    Removing lumpy reclaim saves almost 900 bytes of text whereas the full
    series removes 1200 bytes.

    text data bss dec hex filename
    6740375 1927944 2260992 10929311 a6c49f vmlinux-3.4.0-rc2-vanilla
    6739479 1927944 2260992 10928415 a6c11f vmlinux-3.4.0-rc2-lumpyremove-v2
    6739159 1927944 2260992 10928095 a6bfdf vmlinux-3.4.0-rc2-nosync-v2

    There are behaviour changes in the series and so tests were run with
    monitoring of ftrace events. This disrupts results so the performance
    results are distorted but the new behaviour should be clearer.

    fs-mark running in a threaded configuration showed little of interest as
    it did not push reclaim aggressively

    FS-Mark Multi Threaded
    3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3
    Files/s min 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%)
    Files/s mean 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%)
    Files/s stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Files/s max 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%)
    Overhead min 508667.00 ( 0.00%) 521350.00 (-2.49%) 544292.00 (-7.00%) 547168.00 (-7.57%)
    Overhead mean 551185.00 ( 0.00%) 652690.73 (-18.42%) 991208.40 (-79.83%) 570130.53 (-3.44%)
    Overhead stddev 18200.69 ( 0.00%) 331958.29 (-1723.88%) 1579579.43 (-8578.68%) 9576.81 (47.38%)
    Overhead max 576775.00 ( 0.00%) 1846634.00 (-220.17%) 6901055.00 (-1096.49%) 585675.00 (-1.54%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 309.90 300.95 307.33 298.95
    User+Sys Time Running Test (seconds) 319.32 309.67 315.69 307.51
    Total Elapsed Time (seconds) 1187.85 1193.09 1191.98 1193.73

    MMTests Statistics: vmstat
    Page Ins 80532 82212 81420 79480
    Page Outs 111434984 111456240 111437376 111582628
    Swap Ins 0 0 0 0
    Swap Outs 0 0 0 0
    Direct pages scanned 44881 27889 27453 34843
    Kswapd pages scanned 25841428 25860774 25861233 25843212
    Kswapd pages reclaimed 25841393 25860741 25861199 25843179
    Direct pages reclaimed 44881 27889 27453 34843
    Kswapd efficiency 99% 99% 99% 99%
    Kswapd velocity 21754.791 21675.460 21696.029 21649.127
    Direct efficiency 100% 100% 100% 100%
    Direct velocity 37.783 23.375 23.031 29.188
    Percentage direct scans 0% 0% 0% 0%

    ftrace showed that there was no stalling on writeback or pages submitted
    for IO from reclaim context.

    postmark was similar and while it was more interesting, it also did not
    push reclaim heavily.

    POSTMARK
    3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3
    Transactions per second: 16.00 ( 0.00%) 20.00 (25.00%) 18.00 (12.50%) 17.00 ( 6.25%)
    Data megabytes read per second: 18.80 ( 0.00%) 24.27 (29.10%) 22.26 (18.40%) 20.54 ( 9.26%)
    Data megabytes written per second: 35.83 ( 0.00%) 46.25 (29.08%) 42.42 (18.39%) 39.14 ( 9.24%)
    Files created alone per second: 28.00 ( 0.00%) 38.00 (35.71%) 34.00 (21.43%) 30.00 ( 7.14%)
    Files create/transact per second: 8.00 ( 0.00%) 10.00 (25.00%) 9.00 (12.50%) 8.00 ( 0.00%)
    Files deleted alone per second: 556.00 ( 0.00%) 1224.00 (120.14%) 3062.00 (450.72%) 6124.00 (1001.44%)
    Files delete/transact per second: 8.00 ( 0.00%) 10.00 (25.00%) 9.00 (12.50%) 8.00 ( 0.00%)

    MMTests Statistics: duration
    Sys Time Running Test (seconds) 113.34 107.99 109.73 108.72
    User+Sys Time Running Test (seconds) 145.51 139.81 143.32 143.55
    Total Elapsed Time (seconds) 1159.16 899.23 980.17 1062.27

    MMTests Statistics: vmstat
    Page Ins 13710192 13729032 13727944 13760136
    Page Outs 43071140 42987228 42733684 42931624
    Swap Ins 0 0 0 0
    Swap Outs 0 0 0 0
    Direct pages scanned 0 0 0 0
    Kswapd pages scanned 9941613 9937443 9939085 9929154
    Kswapd pages reclaimed 9940926 9936751 9938397 9928465
    Direct pages reclaimed 0 0 0 0
    Kswapd efficiency 99% 99% 99% 99%
    Kswapd velocity 8576.567 11051.058 10140.164 9347.109
    Direct efficiency 100% 100% 100% 100%
    Direct velocity 0.000 0.000 0.000 0.000

    It looks like here that the full series regresses performance but as
    ftrace showed no usage of wait_iff_congested() or sync reclaim I am
    assuming it's a disruption due to monitoring. Other data such as memory
    usage, page IO, swap IO all looked similar.

    Running a benchmark with a plain DD showed nothing very interesting.
    The full series stalled in wait_iff_congested() slightly less but stall
    times on vanilla kernels were marginal.

    Running a benchmark that hammered on file-backed mappings showed stalls
    due to congestion but not in sync writebacks

    MICRO
    3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 308.13 294.50 298.75 299.53
    User+Sys Time Running Test (seconds) 330.45 316.28 318.93 320.79
    Total Elapsed Time (seconds) 1814.90 1833.88 1821.14 1832.91

    MMTests Statistics: vmstat
    Page Ins 108712 120708 97224 110344
    Page Outs 155514576 156017404 155813676 156193256
    Swap Ins 0 0 0 0
    Swap Outs 0 0 0 0
    Direct pages scanned 2599253 1550480 2512822 2414760
    Kswapd pages scanned 69742364 71150694 68839041 69692533
    Kswapd pages reclaimed 34824488 34773341 34796602 34799396
    Direct pages reclaimed 53693 94750 61792 75205
    Kswapd efficiency 49% 48% 50% 49%
    Kswapd velocity 38427.662 38797.901 37799.972 38022.889
    Direct efficiency 2% 6% 2% 3%
    Direct velocity 1432.174 845.464 1379.807 1317.446
    Percentage direct scans 3% 2% 3% 3%
    Page writes by reclaim 0 0 0 0
    Page writes file 0 0 0 0
    Page writes anon 0 0 0 0
    Page reclaim immediate 0 0 0 1218
    Page rescued immediate 0 0 0 0
    Slabs scanned 15360 16384 13312 16384
    Direct inode steals 0 0 0 0
    Kswapd inode steals 4340 4327 1630 4323

    FTrace Reclaim Statistics: congestion_wait
    Direct number congest waited 0 0 0 0
    Direct time congest waited 0ms 0ms 0ms 0ms
    Direct full congest waited 0 0 0 0
    Direct number conditional waited 900 870 754 789
    Direct time conditional waited 0ms 0ms 0ms 20ms
    Direct full conditional waited 0 0 0 0
    KSwapd number congest waited 2106 2308 2116 1915
    KSwapd time congest waited 139924ms 157832ms 125652ms 132516ms
    KSwapd full congest waited 1346 1530 1202 1278
    KSwapd number conditional waited 12922 16320 10943 14670
    KSwapd time conditional waited 0ms 0ms 0ms 0ms
    KSwapd full conditional waited 0 0 0 0

    Reclaim statistics are not radically changed. The stall times in kswapd
    are massive but it is clear that it is due to calls to congestion_wait()
    and that is almost certainly the call in balance_pgdat(). Otherwise
    stalls due to dirty pages are non-existant.

    I ran a benchmark that stressed high-order allocation. This is very
    artifical load but was used in the past to evaluate lumpy reclaim and
    compaction. Generally I look at allocation success rates and latency
    figures.

    STRESS-HIGHALLOC
    3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3
    Pass 1 81.00 ( 0.00%) 28.00 (-53.00%) 24.00 (-57.00%) 28.00 (-53.00%)
    Pass 2 82.00 ( 0.00%) 39.00 (-43.00%) 38.00 (-44.00%) 43.00 (-39.00%)
    while Rested 88.00 ( 0.00%) 87.00 (-1.00%) 88.00 ( 0.00%) 88.00 ( 0.00%)

    MMTests Statistics: duration
    Sys Time Running Test (seconds) 740.93 681.42 685.14 684.87
    User+Sys Time Running Test (seconds) 2922.65 3269.52 3281.35 3279.44
    Total Elapsed Time (seconds) 1161.73 1152.49 1159.55 1161.44

    MMTests Statistics: vmstat
    Page Ins 4486020 2807256 2855944 2876244
    Page Outs 7261600 7973688 7975320 7986120
    Swap Ins 31694 0 0 0
    Swap Outs 98179 0 0 0
    Direct pages scanned 53494 57731 34406 113015
    Kswapd pages scanned 6271173 1287481 1278174 1219095
    Kswapd pages reclaimed 2029240 1281025 1260708 1201583
    Direct pages reclaimed 1468 14564 16649 92456
    Kswapd efficiency 32% 99% 98% 98%
    Kswapd velocity 5398.133 1117.130 1102.302 1049.641
    Direct efficiency 2% 25% 48% 81%
    Direct velocity 46.047 50.092 29.672 97.306
    Percentage direct scans 0% 4% 2% 8%
    Page writes by reclaim 1616049 0 0 0
    Page writes file 1517870 0 0 0
    Page writes anon 98179 0 0 0
    Page reclaim immediate 103778 27339 9796 17831
    Page rescued immediate 0 0 0 0
    Slabs scanned 1096704 986112 980992 998400
    Direct inode steals 223 215040 216736 247881
    Kswapd inode steals 175331 61548 68444 63066
    Kswapd skipped wait 21991 0 1 0
    THP fault alloc 1 135 125 134
    THP collapse alloc 393 311 228 236
    THP splits 25 13 7 8
    THP fault fallback 0 0 0 0
    THP collapse fail 3 5 7 7
    Compaction stalls 865 1270 1422 1518
    Compaction success 370 401 353 383
    Compaction failures 495 869 1069 1135
    Compaction pages moved 870155 3828868 4036106 4423626
    Compaction move failure 26429 23865 29742 27514

    Success rates are completely hosed for 3.4-rc2 which is almost certainly
    due to commit fe2c2a106663 ("vmscan: reclaim at order 0 when compaction
    is enabled"). I expected this would happen for kswapd and impair
    allocation success rates (https://lkml.org/lkml/2012/1/25/166) but I did
    not anticipate this much a difference: 80% less scanning, 37% less
    reclaim by kswapd

    In comparison, reclaim/compaction is not aggressive and gives up easily
    which is the intended behaviour. hugetlbfs uses __GFP_REPEAT and would
    be much more aggressive about reclaim/compaction than THP allocations
    are. The stress test above is allocating like neither THP or hugetlbfs
    but is much closer to THP.

    Mainline is now impaired in terms of high order allocation under heavy
    load although I do not know to what degree as I did not test with
    __GFP_REPEAT. Keep this in mind for bugs related to hugepage pool
    resizing, THP allocation and high order atomic allocation failures from
    network devices.

    In terms of congestion throttling, I see the following for this test

    FTrace Reclaim Statistics: congestion_wait
    Direct number congest waited 3 0 0 0
    Direct time congest waited 0ms 0ms 0ms 0ms
    Direct full congest waited 0 0 0 0
    Direct number conditional waited 957 512 1081 1075
    Direct time conditional waited 0ms 0ms 0ms 0ms
    Direct full conditional waited 0 0 0 0
    KSwapd number congest waited 36 4 3 5
    KSwapd time congest waited 3148ms 400ms 300ms 500ms
    KSwapd full congest waited 30 4 3 5
    KSwapd number conditional waited 88514 197 332 542
    KSwapd time conditional waited 4980ms 0ms 0ms 0ms
    KSwapd full conditional waited 49 0 0 0

    The "conditional waited" times are the most interesting as this is
    directly impacted by the number of dirty pages encountered during scan.
    As lumpy reclaim is no longer scanning contiguous ranges, it is finding
    fewer dirty pages. This brings wait times from about 5 seconds to 0.
    kswapd itself is still calling congestion_wait() so it'll still stall but
    it's a lot less.

    In terms of the type of IO we were doing, I see this

    FTrace Reclaim Statistics: mm_vmscan_writepage
    Direct writes anon sync 0 0 0 0
    Direct writes anon async 0 0 0 0
    Direct writes file sync 0 0 0 0
    Direct writes file async 0 0 0 0
    Direct writes mixed sync 0 0 0 0
    Direct writes mixed async 0 0 0 0
    KSwapd writes anon sync 0 0 0 0
    KSwapd writes anon async 91682 0 0 0
    KSwapd writes file sync 0 0 0 0
    KSwapd writes file async 822629 0 0 0
    KSwapd writes mixed sync 0 0 0 0
    KSwapd writes mixed async 0 0 0 0

    In 3.2, kswapd was doing a bunch of async writes of pages but
    reclaim/compaction was never reaching a point where it was doing sync
    IO. This does not guarantee that reclaim/compaction was not calling
    wait_on_page_writeback() but I would consider it unlikely. It indicates
    that merging patches 2 and 3 to stop reclaim/compaction calling
    wait_on_page_writeback() should be safe.

    This patch:

    Lumpy reclaim had a purpose but in the mind of some, it was to kick the
    system so hard it trashed. For others the purpose was to complicate
    vmscan.c. Over time it was giving softer shoes and a nicer attitude but
    memory compaction needs to step up and replace it so this patch sends
    lumpy reclaim to the farm.

    The tracepoint format changes for isolating LRU pages with this patch
    applied. Furthermore reclaim/compaction can no longer queue dirty pages
    in pageout() if the underlying BDI is congested. Lumpy reclaim used
    this logic and reclaim/compaction was using it in error.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Ying Han
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The swap token code no longer fits in with the current VM model. It
    does not play well with cgroups or the better NUMA placement code in
    development, since we have only one swap token globally.

    It also has the potential to mess with scalability of the system, by
    increasing the number of non-reclaimable pages on the active and
    inactive anon LRU lists.

    Last but not least, the swap token code has been broken for a year
    without complaints, as reported by Konstantin Khlebnikov. This suggests
    we no longer have much use for it.

    The days of sub-1G memory systems with heavy use of swap are over. If
    we ever need thrashing reducing code in the future, we will have to
    implement something that does scale.

    Signed-off-by: Rik van Riel
    Cc: Konstantin Khlebnikov
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Acked-by: Bob Picco
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • The transparent hugepages feature is careful to not invoke the oom
    killer when a hugepage cannot be allocated.

    pte_alloc_one() failing in __do_huge_pmd_anonymous_page(), however,
    currently results in VM_FAULT_OOM which invokes the pagefault oom killer
    to kill a memory-hogging task.

    This is unnecessary since it's possible to drop the reference to the
    hugepage and fallback to allocating a small page.

    Signed-off-by: David Rientjes
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The "ret" variable is unnecessary in __do_huge_pmd_anonymous_page(), so
    remove it.

    Signed-off-by: David Rientjes
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The arguments f & t and fields from & to of struct file_region are
    defined as long. So use long instead of int to type the temp vars.

    Signed-off-by: Wang Sheng-Hui
    Acked-by: David Rientjes
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • We have enum definition in mempolicy.h: MPOL_REBIND_ONCE. It should
    replace the magic number 0 for step comparison in function
    mpol_rebind_policy.

    Signed-off-by: Wang Sheng-Hui
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • These things tend to get out of sync with time so let the compiler
    automatically enter the current function name using __func__.

    No functional change.

    Signed-off-by: Borislav Petkov
    Acked-by: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     

29 May, 2012

1 commit

  • Pull writeback tree from Wu Fengguang:
    "Mainly from Jan Kara to avoid iput() in the flusher threads."

    * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Avoid iput() from flusher thread
    vfs: Rename end_writeback() to clear_inode()
    vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
    writeback: Refactor writeback_single_inode()
    writeback: Remove wb->list_lock from writeback_single_inode()
    writeback: Separate inode requeueing after writeback
    writeback: Move I_DIRTY_PAGES handling
    writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
    writeback: Move clearing of I_SYNC into inode_sync_complete()
    writeback: initialize global_dirty_limit
    fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
    mm: page-writeback.c: local functions should not be exposed globally

    Linus Torvalds
     

26 May, 2012

4 commits

  • Pull tile updates from Chris Metcalf:
    "These changes cover a range of new arch/tile features and
    optimizations. They've been through LKML review and on linux-next for
    a month or so. There's also one bug-fix that just missed 3.4, which
    I've marked for stable."

    Fixed up trivial conflict in arch/tile/Kconfig (new added tile Kconfig
    entries clashing with the generic timer/clockevents changes).

    * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
    tile: default to tilegx_defconfig for ARCH=tile
    tile: fix bug where fls(0) was not returning 0
    arch/tile: mark TILEGX as not EXPERIMENTAL
    tile/mm/fault.c: Port OOM changes to handle_page_fault
    arch/tile: add descriptive text if the kernel reports a bad trap
    arch/tile: allow querying cpu module information from the hypervisor
    arch/tile: fix hardwall for tilegx and generalize for idn and ipi
    arch/tile: support multiple huge page sizes dynamically
    mm: add new arch_make_huge_pte() method for tile support
    arch/tile: support kexec() for tilegx
    arch/tile: support header for cacheflush() syscall
    arch/tile: Allow tilegx to build with either 16K or 64K page size
    arch/tile: optimize get_user/put_user and friends
    arch/tile: support building big-endian kernel
    arch/tile: allow building Linux with transparent huge pages enabled
    arch/tile: use interrupt critical sections less

    Linus Torvalds
     
  • The tile support for multiple-size huge pages requires tagging
    the hugetlb PTE with a "super" bit for PTEs that are multiples of
    the basic size of a pagetable span. To set that bit properly
    we need to tweak the PTe in make_huge_pte() based on the vma.

    This change provides the API for a subsequent tile-specific
    change to use.

    Reviewed-by: Hillf Danton
    Signed-off-by: Chris Metcalf

    Chris Metcalf
     
  • The change adds some infrastructure for managing tile pmd's more generally,
    using pte_pmd() and pmd_pte() methods to translate pmd values to and
    from ptes, since on TILEPro a pmd is really just a nested structure
    holding a pgd (aka pte). Several existing pmd methods are moved into
    this framework, and a whole raft of additional pmd accessors are defined
    that are used by the transparent hugepage framework.

    The tile PTE now has a "client2" bit. The bit is used to indicate a
    transparent huge page is in the process of being split into subpages.

    This change also fixes a generic bug where the return value of the
    generic pmdp_splitting_flush() was incorrect.

    Signed-off-by: Chris Metcalf

    Chris Metcalf
     
  • Pull CMA and ARM DMA-mapping updates from Marek Szyprowski:
    "These patches contain two major updates for DMA mapping subsystem
    (mainly for ARM architecture). First one is Contiguous Memory
    Allocator (CMA) which makes it possible for device drivers to allocate
    big contiguous chunks of memory after the system has booted.

    The main difference from the similar frameworks is the fact that CMA
    allows to transparently reuse the memory region reserved for the big
    chunk allocation as a system memory, so no memory is wasted when no
    big chunk is allocated. Once the alloc request is issued, the
    framework migrates system pages to create space for the required big
    chunk of physically contiguous memory.

    For more information one can refer to nice LWN articles:

    - 'A reworked contiguous memory allocator':
    http://lwn.net/Articles/447405/

    - 'CMA and ARM':
    http://lwn.net/Articles/450286/

    - 'A deep dive into CMA':
    http://lwn.net/Articles/486301/

    - and the following thread with the patches and links to all previous
    versions:
    https://lkml.org/lkml/2012/4/3/204

    The main client for this new framework is ARM DMA-mapping subsystem.

    The second part provides a complete redesign in ARM DMA-mapping
    subsystem. The core implementation has been changed to use common
    struct dma_map_ops based infrastructure with the recent updates for
    new dma attributes merged in v3.4-rc2. This allows to use more than
    one implementation of dma-mapping calls and change/select them on the
    struct device basis. The first client of this new infractructure is
    dmabounce implementation which has been completely cut out of the
    core, common code.

    The last patch of this redesign update introduces a new, experimental
    implementation of dma-mapping calls on top of generic IOMMU framework.
    This lets ARM sub-platform to transparently use IOMMU for DMA-mapping
    calls if one provides required IOMMU hardware.

    For more information please refer to the following thread:
    http://www.spinics.net/lists/arm-kernel/msg175729.html

    The last patch merges changes from both updates and provides a
    resolution for the conflicts which cannot be avoided when patches have
    been applied on the same files (mainly arch/arm/mm/dma-mapping.c)."

    Acked by Andrew Morton :
    "Yup, this one please. It's had much work, plenty of review and I
    think even Russell is happy with it."

    * 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: (28 commits)
    ARM: dma-mapping: use PMD size for section unmap
    cma: fix migration mode
    ARM: integrate CMA with DMA-mapping subsystem
    X86: integrate CMA with DMA-mapping subsystem
    drivers: add Contiguous Memory Allocator
    mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks
    mm: extract reclaim code from __alloc_pages_direct_reclaim()
    mm: Serialize access to min_free_kbytes
    mm: page_isolation: MIGRATE_CMA isolation functions added
    mm: mmzone: MIGRATE_CMA migration type added
    mm: page_alloc: change fallbacks array handling
    mm: page_alloc: introduce alloc_contig_range()
    mm: compaction: export some of the functions
    mm: compaction: introduce isolate_freepages_range()
    mm: compaction: introduce map_pages()
    mm: compaction: introduce isolate_migratepages_range()
    mm: page_alloc: remove trailing whitespace
    ARM: dma-mapping: add support for IOMMU mapper
    ARM: dma-mapping: use alloc, mmap, free from dma_ops
    ARM: dma-mapping: remove redundant code and do the cleanup
    ...

    Conflicts:
    arch/x86/include/asm/dma-mapping.h

    Linus Torvalds
     

25 May, 2012

2 commits

  • Pull more networking updates from David Miller:
    "Ok, everything from here on out will be bug fixes."

    1) One final sync of wireless and bluetooth stuff from John Linville.
    These changes have all been in his tree for more than a week, and
    therefore have had the necessary -next exposure. John was just away
    on a trip and didn't have a change to send the pull request until a
    day or two ago.

    2) Put back some defines in user exposed header file areas that were
    removed during the tokenring purge. From Stephen Hemminger and Paul
    Gortmaker.

    3) A bug fix for UDP hash table allocation got lost in the pile due to
    one of those "you got it.. no I've got it.." situations. :-)

    From Tim Bird.

    4) SKB coalescing in TCP needs to have stricter checks, otherwise we'll
    try to coalesce overlapping frags and crash. Fix from Eric Dumazet.

    5) RCU routing table lookups can race with free_fib_info(), causing
    crashes when we deref the device pointers in the route. Fix by
    releasing the net device in the RCU callback. From Yanmin Zhang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (293 commits)
    tcp: take care of overlaps in tcp_try_coalesce()
    ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
    mm: add a low limit to alloc_large_system_hash
    ipx: restore token ring define to include/linux/ipx.h
    if: restore token ring ARP type to header
    xen: do not disable netfront in dom0
    phy/micrel: Fix ID of KSZ9021
    mISDN: Add X-Tensions USB ISDN TA XC-525
    gianfar:don't add FCB length to hard_header_len
    Bluetooth: Report proper error number in disconnection
    Bluetooth: Create flags for bt_sk()
    Bluetooth: report the right security level in getsockopt
    Bluetooth: Lock the L2CAP channel when sending
    Bluetooth: Restore locking semantics when looking up L2CAP channels
    Bluetooth: Fix a redundant and problematic incoming MTU check
    Bluetooth: Add support for Foxconn/Hon Hai AR5BBU22 0489:E03C
    Bluetooth: Fix EIR data generation for mgmt_device_found
    Bluetooth: Fix Inquiry with RSSI event mask
    Bluetooth: improve readability of l2cap_seq_list code
    Bluetooth: Fix skb length calculation
    ...

    Linus Torvalds
     
  • Pull user-space probe instrumentation from Ingo Molnar:
    "The uprobes code originates from SystemTap and has been used for years
    in Fedora and RHEL kernels. This version is much rewritten, reviews
    from PeterZ, Oleg and myself shaped the end result.

    This tree includes uprobes support in 'perf probe' - but SystemTap
    (and other tools) can take advantage of user probe points as well.

    Sample usage of uprobes via perf, for example to profile malloc()
    calls without modifying user-space binaries.

    First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled.

    If you don't know which function you want to probe you can pick one
    from 'perf top' or can get a list all functions that can be probed
    within libc (binaries can be specified as well):

    $ perf probe -F -x /lib/libc.so.6

    To probe libc's malloc():

    $ perf probe -x /lib64/libc.so.6 malloc
    Added new event:
    probe_libc:malloc (on 0x7eac0)

    You can now use it in all perf tools, such as:

    perf record -e probe_libc:malloc -aR sleep 1

    Make use of it to create a call graph (as the flat profile is going to
    look very boring):

    $ perf record -e probe_libc:malloc -gR make
    [ perf record: Woken up 173 times to write data ]
    [ perf record: Captured and wrote 44.190 MB perf.data (~1930712

    $ perf report | less

    32.03% git libc-2.15.so [.] malloc
    |
    --- malloc

    29.49% cc1 libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--0.95%-- 0x208eb1000000000
    |
    |--0.63%-- htab_traverse_noresize

    11.04% as libc-2.15.so [.] malloc
    |
    --- malloc
    |

    7.15% ld libc-2.15.so [.] malloc
    |
    --- malloc
    |

    5.07% sh libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.99% python-config libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.54% make libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--7.34%-- glob
    | |
    | |--93.18%-- 0x41588f
    | |
    | --6.82%-- glob
    | 0x41588f

    ...

    Or:

    $ perf report -g flat | less

    # Overhead Command Shared Object Symbol
    # ........ ............. ............. ..........
    #
    32.03% git libc-2.15.so [.] malloc
    27.19%
    malloc

    29.49% cc1 libc-2.15.so [.] malloc
    24.77%
    malloc

    11.04% as libc-2.15.so [.] malloc
    11.02%
    malloc

    7.15% ld libc-2.15.so [.] malloc
    6.57%
    malloc

    ...

    The core uprobes design is fairly straightforward: uprobes probe
    points register themselves at (inode:offset) addresses of
    libraries/binaries, after which all existing (or new) vmas that map
    that address will have a software breakpoint injected at that address.
    vmas are COW-ed to preserve original content. The probe points are
    kept in an rbtree.

    If user-space executes the probed inode:offset instruction address
    then an event is generated which can be recovered from the regular
    perf event channels and mmap-ed ring-buffer.

    Multiple probes at the same address are supported, they create a
    dynamic callback list of event consumers.

    The basic model is further complicated by the XOL speedup: the
    original instruction that is probed is copied (in an architecture
    specific fashion) and executed out of line when the probe triggers.
    The XOL area is a single vma per process, with a fixed number of
    entries (which limits probe execution parallelism).

    The API: uprobes are installed/removed via
    /sys/kernel/debug/tracing/uprobe_events, the API is integrated to
    align with the kprobes interface as much as possible, but is separate
    to it.

    Injecting a probe point is privileged operation, which can be relaxed
    by setting perf_paranoid to -1.

    You can use multiple probes as well and mix them with kprobes and
    regular PMU events or tracepoints, when instrumenting a task."

    Fix up trivial conflicts in mm/memory.c due to previous cleanup of
    unmap_single_vma().

    * 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    perf probe: Detect probe target when m/x options are absent
    perf probe: Provide perf interface for uprobes
    tracing: Fix kconfig warning due to a typo
    tracing: Provide trace events interface for uprobes
    tracing: Extract out common code for kprobes/uprobes trace events
    tracing: Modify is_delete, is_return from int to bool
    uprobes/core: Decrement uprobe count before the pages are unmapped
    uprobes/core: Make background page replacement logic account for rss_stat counters
    uprobes/core: Optimize probe hits with the help of a counter
    uprobes/core: Allocate XOL slots for uprobes use
    uprobes/core: Handle breakpoint and singlestep exceptions
    uprobes/core: Rename bkpt to swbp
    uprobes/core: Make order of function parameters consistent across functions
    uprobes/core: Make macro names consistent
    uprobes: Update copyright notices
    uprobes/core: Move insn to arch specific structure
    uprobes/core: Remove uprobe_opcode_sz
    uprobes/core: Make instruction tables volatile
    uprobes: Move to kernel/events/
    uprobes/core: Clean up, refactor and improve the code
    ...

    Linus Torvalds
     

24 May, 2012

3 commits

  • UDP stack needs a minimum hash size value for proper operation and also
    uses alloc_large_system_hash() for proper NUMA distribution of its hash
    tables and automatic sizing depending on available system memory.

    On some low memory situations, udp_table_init() must ignore the
    alloc_large_system_hash() result and reallocs a bigger memory area.

    As we cannot easily free old hash table, we leak it and kmemleak can
    issue a warning.

    This patch adds a low limit parameter to alloc_large_system_hash() to
    solve this problem.

    We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
    allocation.

    Reported-by: Mark Asselstine
    Reported-by: Tim Bird
    Signed-off-by: Eric Dumazet
    Cc: Paul Gortmaker
    Signed-off-by: David S. Miller

    Tim Bird
     
  • Dave Jones' system call fuzz testing tool "trinity" triggered the
    following bug error with slab debugging enabled

    =============================================================================
    BUG numa_policy (Not tainted): Poison overwritten
    -----------------------------------------------------------------------------

    INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
    INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
    __slab_alloc+0x3d3/0x445
    kmem_cache_alloc+0x29d/0x2b0
    mpol_new+0xa3/0x140
    sys_mbind+0x142/0x620
    system_call_fastpath+0x16/0x1b
    INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
    __slab_free+0x2e/0x1de
    kmem_cache_free+0x25a/0x260
    __mpol_put+0x27/0x30
    remove_vma+0x68/0x90
    exit_mmap+0x118/0x140
    mmput+0x73/0x110
    exit_mm+0x108/0x130
    do_exit+0x162/0xb90
    do_group_exit+0x4f/0xc0
    sys_exit_group+0x17/0x20
    system_call_fastpath+0x16/0x1b
    INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080
    INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0

    This implied a reference counting bug and the problem happened during
    mbind().

    mbind() applies a new memory policy to a range and uses mbind_range() to
    merge existing VMAs or split them as necessary. In the event of splits,
    mpol_dup() will allocate a new struct mempolicy and maintain existing
    reference counts whose rules are documented in
    Documentation/vm/numa_memory_policy.txt .

    The problem occurs with shared memory policies. The vm_op->set_policy
    increments the reference count if necessary and split_vma() and
    vma_merge() have already handled the existing reference counts.
    However, policy_vma() screws it up by replacing an existing
    vma->vm_policy with one that potentially has the wrong reference count
    leading to a premature free. This patch removes the damage caused by
    policy_vma().

    With this patch applied Dave's trinity tool runs an mbind test for 5
    minutes without error. /proc/slabinfo reported that there are no
    numa_policy or shared_policy_node objects allocated after the test
    completed and the shared memory region was deleted.

    Signed-off-by: Mel Gorman
    Cc: Dave Jones
    Cc: KOSAKI Motohiro
    Cc: Stephen Wilson
    Cc: Christoph Lameter
    Cc: Andrew Morton
    Cc:
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     

23 May, 2012

2 commits

  • Pull cgroup updates from Tejun Heo:
    "cgroup file type addition / removal is updated so that file types are
    added and removed instead of individual files so that dynamic file
    type addition / removal can be implemented by cgroup and used by
    controllers. blkio controller changes which will come through block
    tree are dependent on this. Other changes include res_counter cleanup
    and disallowing kthread / PF_THREAD_BOUND threads to be attached to
    non-root cgroups.

    There's a reported bug with the file type addition / removal handling
    which can lead to oops on cgroup umount. The issue is being looked
    into. It shouldn't cause problems for most setups and isn't a
    security concern."

    Fix up trivial conflict in Documentation/feature-removal-schedule.txt

    * 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    res_counter: Account max_usage when calling res_counter_charge_nofail()
    res_counter: Merge res_counter_charge and res_counter_charge_nofail
    cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
    cgroup: remove cgroup_subsys->populate()
    cgroup: get rid of populate for memcg
    cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
    cgroup: make css->refcnt clearing on cgroup removal optional
    cgroup: use negative bias on css->refcnt to block css_tryget()
    cgroup: implement cgroup_rm_cftypes()
    cgroup: introduce struct cfent
    cgroup: relocate __d_cgrp() and __d_cft()
    cgroup: remove cgroup_add_file[s]()
    cgroup: convert memcg controller to the new cftype interface
    memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
    cgroup: convert all non-memcg controllers to the new cftype interface
    cgroup: relocate cftype and cgroup_subsys definitions in controllers
    cgroup: merge cft_release_agent cftype array into the base files array
    cgroup: implement cgroup_add_cftypes() and friends
    cgroup: build list of all cgroups under a given cgroupfs_root
    cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
    ...

    Linus Torvalds
     
  • Pull driver core updates from Greg Kroah-Hartman:
    "Here's the driver core, and other driver subsystems, pull request for
    the 3.5-rc1 merge window.

    Outside of a few minor driver core changes, we ended up with the
    following different subsystem and core changes as well, due to
    interdependancies on the driver core:
    - hyperv driver updates
    - drivers/memory being created and some drivers moved into it
    - extcon driver subsystem created out of the old Android staging
    switch driver code
    - dynamic debug updates
    - printk rework, and /dev/kmsg changes

    All of this has been tested in the linux-next releases for a few weeks
    with no reported problems.

    Signed-off-by: Greg Kroah-Hartman "

    Fix up conflicts in drivers/extcon/extcon-max8997.c where git noticed
    that a patch to the deleted drivers/misc/max8997-muic.c driver needs to
    be applied to this one.

    * tag 'driver-core-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (90 commits)
    uio_pdrv_genirq: get irq through platform resource if not set otherwise
    memory: tegra{20,30}-mc: Remove empty *_remove()
    printk() - isolate KERN_CONT users from ordinary complete lines
    sysfs: get rid of some lockdep false positives
    Drivers: hv: util: Properly handle version negotiations.
    Drivers: hv: Get rid of an unnecessary check in vmbus_prep_negotiate_resp()
    memory: tegra{20,30}-mc: Use dev_err_ratelimited()
    driver core: Add dev_*_ratelimited() family
    Driver Core: don't oops with unregistered driver in driver_find_device()
    printk() - restore prefix/timestamp printing for multi-newline strings
    printk: add stub for prepend_timestamp()
    ARM: tegra30: Make MC optional in Kconfig
    ARM: tegra20: Make MC optional in Kconfig
    ARM: tegra30: MC: Remove unnecessary BUG*()
    ARM: tegra20: MC: Remove unnecessary BUG*()
    printk: correctly align __log_buf
    ARM: tegra30: Add Tegra Memory Controller(MC) driver
    ARM: tegra20: Add Tegra Memory Controller(MC) driver
    printk() - restore timestamp printing at console output
    printk() - do not merge continuation lines of different threads
    ...

    Linus Torvalds
     

21 May, 2012

8 commits

  • This series sanitizes the interface to unmap_vma(). The crazy interface
    annoyed me no end when I was looking at unmap_single_vma(), which we can
    spend quite a lot of time in (especially with loads that have a lot of
    small fork/exec's: shell scripts etc).

    Moving the nr_accounted calculations to where they belong at least
    clarifies things a little. I hope to come back to look at the
    performance of this later, but if/when I get back to it I at least don't
    have to see the crazy interfaces any more.

    * vm-cleanups:
    vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces
    vm: simplify unmap_vmas() calling convention

    Linus Torvalds
     
  • __alloc_contig_migrate_range calls migrate_pages with wrong argument
    for migrate_mode. Fix it.

    Cc: Marek Szyprowski
    Signed-off-by: Minchan Kim
    Acked-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski

    Minchan Kim
     
  • alloc_contig_range() performs memory allocation so it also should keep
    track on keeping the correct level of memory watermarks. This commit adds
    a call to *_slowpath style reclaim to grab enough pages to make sure that
    the final collection of contiguous pages from freelists will not starve
    the system.

    Signed-off-by: Marek Szyprowski
    Signed-off-by: Kyungmin Park
    CC: Michal Nazarewicz
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Marek Szyprowski
     
  • This patch extracts common reclaim code from __alloc_pages_direct_reclaim()
    function to separate function: __perform_reclaim() which can be later used
    by alloc_contig_range().

    Signed-off-by: Marek Szyprowski
    Signed-off-by: Kyungmin Park
    Cc: Michal Nazarewicz
    Acked-by: Mel Gorman
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Marek Szyprowski
     
  • There is a race between the min_free_kbytes sysctl, memory hotplug
    and transparent hugepage support enablement. Memory hotplug uses a
    zonelists_mutex to avoid a race when building zonelists. Reuse it to
    serialise watermark updates.

    [a.p.zijlstra@chello.nl: Older patch fixed the race with spinlock]
    Signed-off-by: Mel Gorman
    Signed-off-by: Marek Szyprowski
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Barry Song

    Mel Gorman
     
  • This commit changes various functions that change pages and
    pageblocks migrate type between MIGRATE_ISOLATE and
    MIGRATE_MOVABLE in such a way as to allow to work with
    MIGRATE_CMA migrate type.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     
  • The MIGRATE_CMA migration type has two main characteristics:
    (i) only movable pages can be allocated from MIGRATE_CMA
    pageblocks and (ii) page allocator will never change migration
    type of MIGRATE_CMA pageblocks.

    This guarantees (to some degree) that page in a MIGRATE_CMA page
    block can always be migrated somewhere else (unless there's no
    memory left in the system).

    It is designed to be used for allocating big chunks (eg. 10MiB)
    of physically contiguous memory. Once driver requests
    contiguous memory, pages from MIGRATE_CMA pageblocks may be
    migrated away to create a contiguous block.

    To minimise number of migrations, MIGRATE_CMA migration type
    is the last type tried when page allocator falls back to other
    migration types when requested.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Signed-off-by: Kyungmin Park
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz
     
  • This commit adds a row for MIGRATE_ISOLATE type to the fallbacks array
    which was missing from it. It also, changes the array traversal logic
    a little making MIGRATE_RESERVE an end marker. The letter change,
    removes the implicit MIGRATE_UNMOVABLE from the end of each row which
    was read by __rmqueue_fallback() function.

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Marek Szyprowski
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Michal Nazarewicz