08 Apr, 2014

5 commits

  • There's only one caller of set_page_dirty_balance() and that will call it
    with page_mkwrite == 0.

    The page_mkwrite argument was unused since commit b827e496c893 "mm: close
    page_mkwrite races".

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • mem_cgroup_newpage_charge is used only for charging anonymous memory so
    it is better to rename it to mem_cgroup_charge_anon.

    mem_cgroup_cache_charge is used for file backed memory so rename it to
    mem_cgroup_charge_file.

    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Let's allow people to tweak faultaround at runtime.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Ning Qu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Here's new version of faultaround patchset. It took a while to tune it
    and collect performance data.

    First patch adds new callback ->map_pages to vm_operations_struct.

    ->map_pages() is called when VM asks to map easy accessible pages.
    Filesystem should find and map pages associated with offsets from
    "pgoff" till "max_pgoff". ->map_pages() is called with page table
    locked and must not block. If it's not possible to reach a page without
    blocking, filesystem should skip it. Filesystem should use do_set_pte()
    to setup page table entry. Pointer to entry associated with offset
    "pgoff" is passed in "pte" field in vm_fault structure. Pointers to
    entries for other offsets should be calculated relative to "pte".

    Currently VM use ->map_pages only on read page fault path. We try to
    map FAULT_AROUND_PAGES a time. FAULT_AROUND_PAGES is 16 for now.
    Performance data for different FAULT_AROUND_ORDER is below.

    TODO:
    - implement ->map_pages() for shmem/tmpfs;
    - modify get_user_pages() to be able to use ->map_pages() and implement
    mmap(MAP_POPULATE|MAP_NONBLOCK) on top.

    =========================================================================
    Tested on 4-socket machine (120 threads) with 128GiB of RAM.

    Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
    somewhere between 3 and 5. Let's say 4 :)

    Linux build (make -j60)
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    minor-faults 283,301,572 247,151,987 212,215,789 204,772,882 199,568,944 194,703,779 193,381,485
    time, seconds 151.227629483 153.920996480 151.356125472 150.863792049 150.879207877 151.150764954 151.450962358
    Linux rebuild (make -j60)
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    minor-faults 5,396,854 4,148,444 2,855,286 2,577,282 2,361,957 2,169,573 2,112,643
    time, seconds 27.404543757 27.559725591 27.030057426 26.855045126 26.678618635 26.974523490 26.761320095
    Git test suite (make -j60 test)
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    minor-faults 129,591,823 99,200,751 66,106,718 57,606,410 51,510,808 45,776,813 44,085,515
    time, seconds 66.087215026 64.784546905 64.401156567 65.282708668 66.034016829 66.793780811 67.237810413

    Two synthetic tests: access every word in file in sequential/random order.
    It doesn't improve much after FAULT_AROUND_ORDER == 4.

    Sequential access 16GiB file
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    1 thread
    minor-faults 4,195,437 2,098,275 525,068 262,251 131,170 32,856 8,282
    time, seconds 7.250461742 6.461711074 5.493859139 5.488488147 5.707213983 5.898510832 5.109232856
    8 threads
    minor-faults 33,557,540 16,892,728 4,515,848 2,366,999 1,423,382 442,732 142,339
    time, seconds 16.649304881 9.312555263 6.612490639 6.394316732 6.669827501 6.75078944 6.371900528
    32 threads
    minor-faults 134,228,222 67,526,810 17,725,386 9,716,537 4,763,731 1,668,921 537,200
    time, seconds 49.164430543 29.712060103 12.938649729 10.175151004 11.840094583 9.594081325 9.928461797
    60 threads
    minor-faults 251,687,988 126,146,952 32,919,406 18,208,804 10,458,947 2,733,907 928,217
    time, seconds 86.260656897 49.626551828 22.335007632 17.608243696 16.523119035 16.339489186 16.326390902
    120 threads
    minor-faults 503,352,863 252,939,677 67,039,168 35,191,827 19,170,091 4,688,357 1,471,862
    time, seconds 124.589206333 79.757867787 39.508707872 32.167281632 29.972989292 28.729834575 28.042251622
    Random access 1GiB file
    1 thread
    minor-faults 262,636 132,743 34,369 17,299 8,527 3,451 1,222
    time, seconds 15.351890914 16.613802482 16.569227308 15.179220992 16.557356122 16.578247824 15.365266994
    8 threads
    minor-faults 2,098,948 1,061,871 273,690 154,501 87,110 25,663 7,384
    time, seconds 15.040026343 15.096933500 14.474757288 14.289129964 14.411537468 14.296316837 14.395635804
    32 threads
    minor-faults 8,390,734 4,231,023 1,054,432 528,847 269,242 97,746 26,881
    time, seconds 20.430433109 21.585235358 22.115062928 14.872878951 14.880856305 14.883370649 14.821261690
    60 threads
    minor-faults 15,733,258 7,892,809 1,973,393 988,266 594,789 164,994 51,691
    time, seconds 26.577302548 25.692397770 18.728863715 20.153026398 21.619101933 17.745086260 17.613215273
    120 threads
    minor-faults 31,471,111 15,816,616 3,959,209 1,978,685 1,008,299 264,635 96,010
    time, seconds 41.835322703 40.459786095 36.085306105 35.313894834 35.814445675 36.552633793 34.289210594

    Touch only one page in page table in 16GiB file
    FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
    1 thread
    minor-faults 8,372 8,324 8,270 8,260 8,249 8,239 8,237
    time, seconds 0.039892712 0.045369149 0.051846126 0.063681685 0.079095975 0.17652406 0.541213386
    8 threads
    minor-faults 65,731 65,681 65,628 65,620 65,608 65,599 65,596
    time, seconds 0.124159196 0.488600638 0.156854426 0.191901957 0.242631486 0.543569456 1.677303984
    32 threads
    minor-faults 262,388 262,341 262,285 262,276 262,266 262,257 263,183
    time, seconds 0.452421421 0.488600638 0.565020946 0.648229739 0.789850823 1.651584361 5.000361559
    60 threads
    minor-faults 491,822 491,792 491,723 491,711 491,701 491,691 491,825
    time, seconds 0.763288616 0.869620515 0.980727360 1.161732354 1.466915814 3.04041448 9.308612938
    120 threads
    minor-faults 983,466 983,655 983,366 983,372 983,363 984,083 984,164
    time, seconds 1.595846553 1.667902182 2.008959376 2.425380942 2.941368804 5.977807890 18.401846125

    This patch (of 2):

    Introduce new vm_ops callback ->map_pages() and uses it for mapping easy
    accessible pages around fault address.

    On read page fault, if filesystem provides ->map_pages(), we try to map up
    to FAULT_AROUND_PAGES pages around page fault address in hope to reduce
    number of minor page faults.

    We call ->map_pages first and use ->fault() as fallback if page by the
    offset is not ready to be mapped (cold page cache or something).

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Ning Qu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The described issue now occurs inside mmap_region(). And unfortunately
    is still valid.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

05 Apr, 2014

1 commit

  • get_user_pages(write=1, force=1) has always had odd behaviour on write-
    protected shared mappings: although it demands FMODE_WRITE-access to the
    underlying object (do_mmap_pgoff sets neither VM_SHARED nor VM_MAYWRITE
    without that), it ends up with do_wp_page substituting private anonymous
    Copied-On-Write pages for the shared file pages in the area.

    That was long ago intentional, as a safety measure to prevent ptrace
    setting a breakpoint (or POKETEXT or POKEDATA) from inadvertently
    corrupting the underlying executable. Yet exec and dynamic loaders open
    the file read-only, and use MAP_PRIVATE rather than MAP_SHARED.

    The traditional odd behaviour still causes surprises and bugs in mm, and
    is probably not what any caller wants - even the comment on the flag
    says "You do not want this" (although it's undoubtedly necessary for
    overriding userspace protections in some contexts, and good when !write).

    Let's stop doing that. But it would be dangerous to remove the long-
    standing safety at this stage, so just make get_user_pages(write,force)
    fail with EFAULT when applied to a write-protected shared area.
    Infiniband may in future want to force write through to underlying
    object: we can add another FOLL_flag later to enable that if required.

    Odd though the old behaviour was, there is no doubt that we may turn out
    to break userspace with this change, and have to revert it quickly.
    Issue a WARN_ON_ONCE to help debug the changed case (easily triggered by
    userspace, so only once to prevent spamming the logs); and delay a few
    associated cleanups until this change is proved.

    get_user_pages callers who might see trouble from this change:
    ptrace poking, or writing to /proc//mem
    drivers/infiniband/
    drivers/media/v4l2-core/
    drivers/gpu/drm/exynos/exynos_drm_gem.c
    drivers/staging/tidspbridge/core/tiomap3430.c
    if they ever apply get_user_pages to write-protected shared mappings
    of an object which was opened for writing.

    I went to apply the same change to mm/nommu.c, but retreated. NOMMU has
    no place for COW, and its VM_flags conventions are not the same: I'd be
    more likely to screw up NOMMU than make an improvement there.

    Suggested-by: Linus Torvalds
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 Apr, 2014

8 commits

  • Extract and consolidate code to setup pte from do_read_fault(),
    do_cow_fault() and do_shared_fault().

    Signed-off-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • There are two functions which need to call vm_ops->page_mkwrite():
    do_shared_fault() and do_wp_page(). We can consolidate preparation
    code.

    Signed-off-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Introduce do_shared_fault(). The function does what do_fault() does for
    write faults to shared mappings

    Unlike do_fault(), do_shared_fault() is relatively clean and
    straight-forward.

    Old do_fault() is not needed anymore. Let it die.

    [lliubbo@gmail.com: fix NULL pointer dereference]
    Signed-off-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Signed-off-by: Bob Liu
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Introduce do_cow_fault(). The function does what do_fault() does for
    write page faults to private mappings.

    Unlike do_fault(), do_read_fault() is relatively clean and
    straight-forward.

    Signed-off-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Introduce do_read_fault(). The function does what do_fault() does for
    read page faults.

    Unlike do_fault(), do_read_fault() is pretty clean and straightforward.

    Signed-off-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Extract code to vm_ops->do_fault() and basic error handling to separate
    function. The code will be reused.

    Signed-off-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Current __do_fault() is awful and unmaintainable. These patches try to
    sort it out by split __do_fault() into three destinct codepaths:

    - to handle read page fault;
    - to handle write page fault to private mappings;
    - to handle write page fault to shared mappings;

    I also found page refcount leak in PageHWPoison() path of __do_fault().

    This patch (of 7):

    do_fault() is unused: no reason for underscores.

    Signed-off-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • mark functions as static in memory.c because they are not used outside
    this file.

    This eliminates the following warnings in mm/memory.c:

    mm/memory.c:3530:5: warning: no previous prototype for `numa_migrate_prep' [-Wmissing-prototypes]
    mm/memory.c:3545:5: warning: no previous prototype for `do_numa_page' [-Wmissing-prototypes]

    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rashika Kheria
     

26 Feb, 2014

2 commits

  • Masayoshi Mizuma reported a bug with the hang of an application under
    the memcg limit. It happens on write-protection fault to huge zero page

    If we successfully allocate a huge page to replace zero page but hit the
    memcg limit we need to split the zero page with split_huge_page_pmd()
    and fallback to small pages.

    The other part of the problem is that VM_FAULT_OOM has special meaning
    in do_huge_pmd_wp_page() context. __handle_mm_fault() expects the page
    to be split if it sees VM_FAULT_OOM and it will will retry page fault
    handling. This causes an infinite loop if the page was not split.

    do_huge_pmd_wp_zero_page_fallback() can return VM_FAULT_OOM if it failed
    to allocate one small page, so fallback to small pages will not help.

    The solution for this part is to replace VM_FAULT_OOM with
    VM_FAULT_FALLBACK is fallback required.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Masayoshi Mizuma
    Reviewed-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • It seems we forget to release page after detecting HW error.

    Signed-off-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

24 Jan, 2014

2 commits

  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • bad_page() is cool in that it prints out a bunch of data about the page.
    But, I can never remember which page flags are good and which are bad,
    or whether ->index or ->mapping is required to be NULL.

    This patch allows bad/dump_page() callers to specify a string about why
    they are dumping the page and adds explanation strings to a number of
    places. It also adds a 'bad_flags' argument to bad_page(), which it
    then dumps out separately from the flags which are actually set.

    This way, the messages will show specifically why the page was bad,
    *specifically* which flags it is complaining about, if it was a page
    flag combination which was the problem.

    [akpm@linux-foundation.org: switch to pr_alert]
    Signed-off-by: Dave Hansen
    Reviewed-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

22 Jan, 2014

2 commits

  • If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
    is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab,
    so we loose 24 on each. An average system can easily allocate few tens
    thousands of page->ptl and overhead is significant.

    Let's create a separate slab for page->ptl allocation to solve this.

    To make sure that it really works this time, some numbers from my test
    machine (just booted, no load):

    Before:
    # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
    kmalloc-96 31987 32190 128 30 1 : tunables 120 60 8 : slabdata 1073 1073 92
    After:
    # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
    page->ptl 27516 28143 72 53 1 : tunables 120 60 8 : slabdata 531 531 9
    kmalloc-96 3853 5280 128 30 1 : tunables 120 60 8 : slabdata 176 176 0

    Note that the patch is useful not only for debug case, but also for
    PREEMPT_RT, where spinlock_t is always bloated.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Record actively mapped pages and provide an api for asserting a given
    page is dma inactive before execution proceeds. Placing
    debug_dma_assert_idle() in cow_user_page() flagged the violation of the
    dma-api in the NET_DMA implementation (see commit 77873803363c "net_dma:
    mark broken").

    The implementation includes the capability to count, in a limited way,
    repeat mappings of the same page that occur without an intervening
    unmap. This 'overlap' counter is limited to the few bits of tag space
    in a radix tree. This mechanism is added to mitigate false negative
    cases where, for example, a page is dma mapped twice and
    debug_dma_assert_idle() is called after the page is un-mapped once.

    Signed-off-by: Dan Williams
    Cc: Joerg Roedel
    Cc: Vinod Koul
    Cc: Russell King
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

21 Dec, 2013

2 commits

  • Commit 597d795a2a78 ('mm: do not allocate page->ptl dynamically, if
    spinlock_t fits to long') restructures some allocators that are compiled
    even if USE_SPLIT_PTLOCKS arn't used. It results in compilation
    failure:

    mm/memory.c:4282:6: error: 'struct page' has no member named 'ptl'
    mm/memory.c:4288:12: error: 'struct page' has no member named 'ptl'

    Add in the missing ifdef.

    Fixes: 597d795a2a78 ('mm: do not allocate page->ptl dynamically, if spinlock_t fits to long')
    Signed-off-by: Olof Johansson
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Olof Johansson
     
  • In struct page we have enough space to fit long-size page->ptl there,
    but we use dynamically-allocated page->ptl if size(spinlock_t) is larger
    than sizeof(int).

    It hurts 64-bit architectures with CONFIG_GENERIC_LOCKBREAK, where
    sizeof(spinlock_t) == 8, but it easily fits into struct page.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Nov, 2013

1 commit

  • This reverts commit ea1e7ed33708c7a760419ff9ded0a6cb90586a50.

    Al points out that while the commit *does* actually create a separate
    slab for the page->ptl allocation, that slab is never actually used, and
    the code continues to use kmalloc/kfree.

    Damien Wyart points out that the original patch did have the conversion
    to use kmem_cache_alloc/free, so it got lost somewhere on its way to me.

    Revert the half-arsed attempt that didn't do anything. If we really do
    want the special slab (remember: this is all relevant just for debug
    builds, so it's not necessarily all that critical) we might as well redo
    the patch fully.

    Reported-by: Al Viro
    Acked-by: Andrew Morton
    Cc: Kirill A Shutemov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Nov, 2013

5 commits

  • If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
    is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab,
    so we loose 24 on each. An average system can easily allocate few tens
    thousands of page->ptl and overhead is significant.

    Let's create a separate slab for page->ptl allocation to solve this.

    Signed-off-by: Kirill A. Shutemov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Use kernel/bounds.c to convert build-time spinlock_t size check into a
    preprocessor symbol and apply that to properly separate the page::ptl
    situation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Kirill A. Shutemov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • If split page table lock is in use, we embed the lock into struct page
    of table's page. We have to disable split lock, if spinlock_t is too
    big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC
    enabled.

    This patch add support for dynamic allocation of split page table lock
    if we can't embed it to struct page.

    page->ptl is unsigned long now and we use it as spinlock_t if
    sizeof(spinlock_t) ptl.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Only trivial cases left. Let's convert them altogether.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With split page table lock for PMD level we can't hold mm->page_table_lock
    while updating nr_ptes.

    Let's convert it to atomic_long_t to avoid races.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Nov, 2013

3 commits

  • Merge first patch-bomb from Andrew Morton:
    "Quite a lot of other stuff is banked up awaiting further
    next->mainline merging, but this batch contains:

    - Lots of random misc patches
    - OCFS2
    - Most of MM
    - backlight updates
    - lib/ updates
    - printk updates
    - checkpatch updates
    - epoll tweaking
    - rtc updates
    - hfs
    - hfsplus
    - documentation
    - procfs
    - update gcov to gcc-4.7 format
    - IPC"

    * emailed patches from Andrew Morton : (269 commits)
    ipc, msg: fix message length check for negative values
    ipc/util.c: remove unnecessary work pending test
    devpts: plug the memory leak in kill_sb
    ./Makefile: export initial ramdisk compression config option
    init/Kconfig: add option to disable kernel compression
    drivers: w1: make w1_slave::flags long to avoid memory corruption
    drivers/w1/masters/ds1wm.cuse dev_get_platdata()
    drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
    drivers/memstick/core/mspro_block.c: fix attributes array allocation
    drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
    kernel/panic.c: reduce 1 byte usage for print tainted buffer
    gcov: reuse kbasename helper
    kernel/gcov/fs.c: use pr_warn()
    kernel/module.c: use pr_foo()
    gcov: compile specific gcov implementation based on gcc version
    gcov: add support for gcc 4.7 gcov format
    gcov: move gcov structs definitions to a gcc version specific file
    kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
    kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
    kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
    ...

    Linus Torvalds
     
  • Pull vfs updates from Al Viro:
    "All kinds of stuff this time around; some more notable parts:

    - RCU'd vfsmounts handling
    - new primitives for coredump handling
    - files_lock is gone
    - Bruce's delegations handling series
    - exportfs fixes

    plus misc stuff all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
    ecryptfs: ->f_op is never NULL
    locks: break delegations on any attribute modification
    locks: break delegations on link
    locks: break delegations on rename
    locks: helper functions for delegation breaking
    locks: break delegations on unlink
    namei: minor vfs_unlink cleanup
    locks: implement delegations
    locks: introduce new FL_DELEG lock flag
    vfs: take i_mutex on renamed file
    vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
    vfs: don't use PARENT/CHILD lock classes for non-directories
    vfs: pull ext4's double-i_mutex-locking into common code
    exportfs: fix quadratic behavior in filehandle lookup
    exportfs: better variable name
    exportfs: move most of reconnect_path to helper function
    exportfs: eliminate unused "noprogress" counter
    exportfs: stop retrying once we race with rename/remove
    exportfs: clear DISCONNECTED on all parents sooner
    exportfs: more detailed comment for path_reconnect
    ...

    Linus Torvalds
     
  • The callers of free_pgd_range() and hugetlb_free_pgd_range() don't hold
    page table locks. The comments seems to be obsolete, so let's remove
    them.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

01 Nov, 2013

1 commit

  • Resolve cherry-picking conflicts:

    Conflicts:
    mm/huge_memory.c
    mm/memory.c
    mm/mprotect.c

    See this upstream merge commit for more details:

    52469b4fcd4f Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

29 Oct, 2013

1 commit

  • There are three callers of task_numa_fault():

    - do_huge_pmd_numa_page():
    Accounts against the current node, not the node where the
    page resides, unless we migrated, in which case it accounts
    against the node we migrated to.

    - do_numa_page():
    Accounts against the current node, not the node where the
    page resides, unless we migrated, in which case it accounts
    against the node we migrated to.

    - do_pmd_numa_page():
    Accounts not at all when the page isn't migrated, otherwise
    accounts against the node we migrated towards.

    This seems wrong to me; all three sites should have the same
    sementaics, furthermore we should accounts against where the page
    really is, we already know where the task is.

    So modify all three sites to always account; we did after all receive
    the fault; and always account to where the page is after migration,
    regardless of success.

    They all still differ on when they clear the PTE/PMD; ideally that
    would get sorted too.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

25 Oct, 2013

1 commit


17 Oct, 2013

2 commits

  • Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
    callstack on OOM") assumed that only a few places that can trigger a
    memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
    readahead. But there are many more and it's impractical to annotate
    them all.

    First of all, we don't want to invoke the OOM killer when the failed
    allocation is gracefully handled, so defer the actual kill to the end of
    the fault handling as well. This simplifies the code quite a bit for
    added bonus.

    Second, since a failed allocation might not be the abrupt end of the
    fault, the memcg OOM handler needs to be re-entrant until the fault
    finishes for subsequent allocation attempts. If an allocation is
    attempted after the task already OOMed, allow it to bypass the limit so
    that it can quickly finish the fault and invoke the OOM killer.

    Reported-by: azurIt
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If page migration is turned on in config and the page is migrating, we
    may lose the soft dirty bit. If fork and mprotect are called on
    migrating pages (once migration is complete) pages do not obtain the
    soft dirty bit in the correspond pte entries. Fix it adding an
    appropriate test on swap entries.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Andy Lutomirski
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

09 Oct, 2013

4 commits

  • Adjust numa_scan_period in task_numa_placement, depending on how much
    useful work the numa code can do. The more local faults there are in a
    given scan window the longer the period (and hence the slower the scan rate)
    during the next window. If there are excessive shared faults then the scan
    period will decrease with the amount of scaling depending on whether the
    ratio of shared/private faults. If the preferred node changes then the
    scan rate is reset to recheck if the task is properly placed.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-59-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Due to the way the pid is truncated, and tasks are moved between
    CPUs by the scheduler, it is possible for the current task_numa_fault
    to group together tasks that do not actually share memory together.

    This patch adds a few easy sanity checks to task_numa_fault, joining
    tasks together if they share the same tsk->mm, or if the fault was on
    a page with an elevated mapcount, in a shared VMA.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-57-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • With the THP migration races closed it is still possible to occasionally
    see corruption. The problem is related to handling PMD pages in batch.
    When a page fault is handled it can be assumed that the page being
    faulted will also be flushed from the TLB. The same flushing does not
    happen when handling PMD pages in batch. Fixing is straight forward but
    there are a number of reasons not to

    1. Multiple TLB flushes may have to be sent depending on what pages get
    migrated
    2. The handling of PMDs in batch means that faults get accounted to
    the task that is handling the fault. While care is taken to only
    mark PMDs where the last CPU and PID match it can still have problems
    due to PID truncation when matching PIDs.
    3. Batching on the PMD level may reduce faults but setting pmd_numa
    requires taking a heavy lock that can contend with THP migration
    and handling the fault requires the release/acquisition of the PTL
    for every page migrated. It's still pretty heavy.

    PMD batch handling is not something that people ever have been happy
    with. This patch removes it and later patches will deal with the
    additional fault overhead using more installigent migrate rate adaption.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-48-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • And here's a little something to make sure not the whole world ends up
    in a single group.

    As while we don't migrate shared executable pages, we do scan/fault on
    them. And since everybody links to libc, everybody ends up in the same
    group.

    Suggested-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Link: http://lkml.kernel.org/r/1381141781-10992-47-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Peter Zijlstra