03 Jul, 2018

1 commit

  • commit 1105a2fc022f3c7482e32faf516e8bc44095f778 upstream.

    In our armv8a server(QDF2400), I noticed lots of WARN_ON caused by
    PAGE_SIZE unaligned for rmap_item->address under memory pressure
    tests(start 20 guests and run memhog in the host).

    WARNING: CPU: 4 PID: 4641 at virt/kvm/arm/mmu.c:1826 kvm_age_hva_handler+0xc0/0xc8
    CPU: 4 PID: 4641 Comm: memhog Tainted: G W 4.17.0-rc3+ #8
    Call trace:
    kvm_age_hva_handler+0xc0/0xc8
    handle_hva_to_gpa+0xa8/0xe0
    kvm_age_hva+0x4c/0xe8
    kvm_mmu_notifier_clear_flush_young+0x54/0x98
    __mmu_notifier_clear_flush_young+0x6c/0xa0
    page_referenced_one+0x154/0x1d8
    rmap_walk_ksm+0x12c/0x1d0
    rmap_walk+0x94/0xa0
    page_referenced+0x194/0x1b0
    shrink_page_list+0x674/0xc28
    shrink_inactive_list+0x26c/0x5b8
    shrink_node_memcg+0x35c/0x620
    shrink_node+0x100/0x430
    do_try_to_free_pages+0xe0/0x3a8
    try_to_free_pages+0xe4/0x230
    __alloc_pages_nodemask+0x564/0xdc0
    alloc_pages_vma+0x90/0x228
    do_anonymous_page+0xc8/0x4d0
    __handle_mm_fault+0x4a0/0x508
    handle_mm_fault+0xf8/0x1b0
    do_page_fault+0x218/0x4b8
    do_translation_fault+0x90/0xa0
    do_mem_abort+0x68/0xf0
    el0_da+0x24/0x28

    In rmap_walk_ksm, the rmap_item->address might still have the
    STABLE_FLAG, then the start and end in handle_hva_to_gpa might not be
    PAGE_SIZE aligned. Thus it will cause exceptions in handle_hva_to_gpa
    on arm64.

    This patch fixes it by ignoring (not removing) the low bits of address
    when doing rmap_walk_ksm.

    IMO, it should be backported to stable tree. the storm of WARN_ONs is
    very easy for me to reproduce. More than that, I watched a panic (not
    reproducible) as follows:

    page:ffff7fe003742d80 count:-4871 mapcount:-2126053375 mapping: (null) index:0x0
    flags: 0x1fffc00000000000()
    raw: 1fffc00000000000 0000000000000000 0000000000000000 ffffecf981470000
    raw: dead000000000100 dead000000000200 ffff8017c001c000 0000000000000000
    page dumped because: nonzero _refcount
    CPU: 29 PID: 18323 Comm: qemu-kvm Tainted: G W 4.14.15-5.hxt.aarch64 #1
    Hardware name:
    Call trace:
    dump_backtrace+0x0/0x22c
    show_stack+0x24/0x2c
    dump_stack+0x8c/0xb0
    bad_page+0xf4/0x154
    free_pages_check_bad+0x90/0x9c
    free_pcppages_bulk+0x464/0x518
    free_hot_cold_page+0x22c/0x300
    __put_page+0x54/0x60
    unmap_stage2_range+0x170/0x2b4
    kvm_unmap_hva_handler+0x30/0x40
    handle_hva_to_gpa+0xb0/0xec
    kvm_unmap_hva_range+0x5c/0xd0

    I even injected a fault on purpose in kvm_unmap_hva_range by seting
    size=size-0x200, the call trace is similar as above. So I thought the
    panic is similarly caused by the root cause of WARN_ON.

    Andrea said:

    : It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would
    : zap those bits and hide the misalignment caused by the low metadata
    : bits being erroneously left set in the address, but the arm code
    : notices when that's the last page in the memslot and the hva_end is
    : getting aligned and the size is below one page.
    :
    : I think the problem triggers in the addr += PAGE_SIZE of
    : unmap_stage2_ptes that never matches end because end is aligned but
    : addr is not.
    :
    : } while (pte++, addr += PAGE_SIZE, addr != end);
    :
    : x86 again only works on hva_start/hva_end after converting it to
    : gfn_start/end and that being in pfn units the bits are zapped before
    : they risk to cause trouble.

    Jia He said:

    : I've tested by myself in arm64 server (QDF2400,46 cpus,96G mem) Without
    : this patch, the WARN_ON is very easy for reproducing. After this patch, I
    : have run the same benchmarch for a whole day without any WARN_ONs

    Link: http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejianet@gmail.com
    Signed-off-by: Jia He
    Reviewed-by: Andrea Arcangeli
    Tested-by: Jia He
    Cc: Suzuki K Poulose
    Cc: Minchan Kim
    Cc: Claudio Imbrenda
    Cc: Arvind Yadav
    Cc: Mike Rapoport
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jia He
     

30 May, 2018

1 commit

  • [ Upstream commit 77da2ba0648a4fd52e5ff97b8b2b8dd312aec4b0 ]

    This patch fixes a corner case for KSM. When two pages belong or
    belonged to the same transparent hugepage, and they should be merged,
    KSM fails to split the page, and therefore no merging happens.

    This bug can be reproduced by:
    * making sure ksm is running (in case disabling ksmtuned)
    * enabling transparent hugepages
    * allocating a THP-aligned 1-THP-sized buffer
    e.g. on amd64: posix_memalign(&p, 1<<<<
    Co-authored-by: Gerald Schaefer
    Reviewed-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Claudio Imbrenda
     

24 Apr, 2018

1 commit

  • commit a38c015f3156895b07e71d4e4414289f8a3b2745 upstream.

    When using KSM with use_zero_pages, we replace anonymous pages
    containing only zeroes with actual zero pages, which are not anonymous.
    We need to do proper accounting of the mm counters, otherwise we will
    get wrong values in /proc and a BUG message in dmesg when tearing down
    the mm.

    Link: http://lkml.kernel.org/r/1522931274-15552-1-git-send-email-imbrenda@linux.vnet.ibm.com
    Fixes: e86c59b1b1 ("mm/ksm: improve deduplication of zero pages with colouring")
    Signed-off-by: Claudio Imbrenda
    Reviewed-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Christian Borntraeger
    Cc: Gerald Schaefer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Claudio Imbrenda
     

04 Oct, 2017

1 commit

  • In this place mm is unlocked, so vmas or list may change. Down read
    mmap_sem to protect them from modifications.

    Link: http://lkml.kernel.org/r/150512788393.10691.8868381099691121308.stgit@localhost.localdomain
    Fixes: e86c59b1b12d ("mm/ksm: improve deduplication of zero pages with colouring")
    Signed-off-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Reviewed-by: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: zhong jiang
    Cc: Ingo Molnar
    Cc: Claudio Imbrenda
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

07 Sep, 2017

1 commit

  • attribute_group are not supposed to change at runtime. All functions
    working with attribute_group provided by work with const
    attribute_group. So mark the non-const structs as const.

    Link: http://lkml.kernel.org/r/1501157167-3706-2-git-send-email-arvind.yadav.cs@gmail.com
    Signed-off-by: Arvind Yadav
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arvind Yadav
     

11 Aug, 2017

1 commit

  • Nadav reported KSM can corrupt the user data by the TLB batching
    race[1]. That means data user written can be lost.

    Quote from Nadav Amit:
    "For this race we need 4 CPUs:

    CPU0: Caches a writable and dirty PTE entry, and uses the stale value
    for write later.

    CPU1: Runs madvise_free on the range that includes the PTE. It would
    clear the dirty-bit. It batches TLB flushes.

    CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty.
    We care about the fact that it clears the PTE write-bit, and of
    course, batches TLB flushes.

    CPU3: Runs KSM. Our purpose is to pass the following test in
    write_protect_page():

    if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))

    Since it will avoid TLB flush. And we want to do it while the PTE is
    stale. Later, and before replacing the page, we would be able to
    change the page.

    Note that all the operations the CPU1-3 perform canhappen in parallel
    since they only acquire mmap_sem for read.

    We start with two identical pages. Everything below regards the same
    page/PTE.

    CPU0 CPU1 CPU2 CPU3
    ---- ---- ---- ----
    Write the same
    value on page

    [cache PTE as
    dirty in TLB]

    MADV_FREE
    pte_mkclean()

    4 > clear_refs
    pte_wrprotect()

    write_protect_page()
    [ success, no flush ]

    pages_indentical()
    [ ok ]

    Write to page
    different value

    [Ok, using stale
    PTE]

    replace_page()

    Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late.
    CPU0 already wrote on the page, but KSM ignored this write, and it got
    lost"

    In above scenario, MADV_FREE is fixed by changing TLB batching API
    including [set|clear]_tlb_flush_pending. Remained thing is soft-dirty
    part.

    This patch changes soft-dirty uses TLB batching API instead of
    flush_tlb_mm and KSM checks pending TLB flush by using
    mm_tlb_flush_pending so that it will flush TLB to avoid data lost if
    there are other parallel threads pending TLB flush.

    [1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com

    Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.com
    Signed-off-by: Minchan Kim
    Signed-off-by: Nadav Amit
    Reported-by: Nadav Amit
    Tested-by: Nadav Amit
    Reviewed-by: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: "David S. Miller"
    Cc: Andy Lutomirski
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Jeff Dike
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Nadav Amit
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Sergey Senozhatsky
    Cc: Tony Luck
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

07 Jul, 2017

5 commits

  • If a candidate stable_node_dup has been found and it can accept further
    merges it can be refiled to the head of the list to speedup next
    searches without altering which dup is found and how the dups accumulate
    in the chain.

    We already refiled it back to the head in the prune_stale_stable_nodes
    case, but we didn't refile it if not pruning (which is more common).
    And we also refiled it when it was already at the head which is
    unnecessary (in the prune_stale_stable_nodes case, nr > 1 means there's
    more than one dup in the chain, it doesn't mean it's not already at the
    head of the chain).

    The stable_node_chain list is single threaded and there's no SMP locking
    contention so it should be faster to refile it to the head of the list
    also if prune_stale_stable_nodes is false.

    Profiling shows the refile happens 1.9% of the time when a dup is found
    with a max_page_sharing limit setting of 3 (with max_page_sharing of 2
    the refile never happens of course as there's never space for one more
    merge) which is reasonably low. At higher max_page_sharing values it
    should be much less frequent.

    This is just an optimization.

    Link: http://lkml.kernel.org/r/20170518173721.22316-4-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Evgheni Dereveanchin
    Cc: Andrey Ryabinin
    Cc: Petr Holasek
    Cc: Hugh Dickins
    Cc: Arjan van de Ven
    Cc: Davidlohr Bueso
    Cc: Gavin Guo
    Cc: Jay Vosburgh
    Cc: Mel Gorman
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Some static checker complains if chain/chain_prune returns a potentially
    stale pointer.

    There are two output parameters to chain/chain_prune, one is tree_page
    the other is stable_node_dup. Like in get_ksm_page the caller has to
    check tree_page is NULL before touching the stable_node. Similarly in
    chain/chain_prune the caller has to check tree_page before touching the
    stable_node_dup returned or the original stable_node passed as
    parameter.

    Because the tree_page is never returned as a stale pointer, it may be
    more intuitive to return tree_page and to pass stable_node_dup for
    reference instead of the reverse.

    This patch purely swaps the two output parameters of chain/chain_prune
    as a cleanup for the static checker and to mimic the get_ksm_page
    behavior more closely. There's no change to the caller at all except
    the swap, it's purely a cleanup and it is a noop from the caller point
    of view.

    Link: http://lkml.kernel.org/r/20170518173721.22316-3-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Dan Carpenter
    Tested-by: Dan Carpenter
    Cc: Evgheni Dereveanchin
    Cc: Andrey Ryabinin
    Cc: Petr Holasek
    Cc: Hugh Dickins
    Cc: Arjan van de Ven
    Cc: Davidlohr Bueso
    Cc: Gavin Guo
    Cc: Jay Vosburgh
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Patch series "KSMscale cleanup/optimizations".

    There are no fixes here it's just minor cleanups and optimizations.

    1/3 removes makes the "fix" for the stale stable_node fall in the
    standard case without introducing new cases. Setting stable_node to
    NULL was marginally safer, but stale pointer is still wiped from the
    caller, this looks cleaner.

    2/3 should fix the false positive from Dan's static checker.

    3/3 is a microoptimization to apply the the refile of future merge
    candidate dups at the head of the chain in all cases and to skip it in
    one case where we did it and but it was a noop (to avoid checking if
    it was already at the head but now we've to check it anyway so it got
    optimized away).

    This patch (of 3):

    When the stable_node chain is collapsed we can as well set the caller
    stable_node to match the returned stable_node_dup in chain_prune().

    This way the collapse case becomes indistinguishable from the regular
    stable_node case and we can remove two branches from the KSM page
    migration handling slow paths.

    While it was all correct this looks cleaner (and faster) as the caller has
    to deal with fewer special cases.

    Link: http://lkml.kernel.org/r/20170518173721.22316-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Evgheni Dereveanchin
    Cc: Andrey Ryabinin
    Cc: Petr Holasek
    Cc: Hugh Dickins
    Cc: Arjan van de Ven
    Cc: Davidlohr Bueso
    Cc: Gavin Guo
    Cc: Jay Vosburgh
    Cc: Mel Gorman
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • If merge_across_nodes was manually set to 0 (not the default value) by
    the admin or a tuned profile on NUMA systems triggering cross-NODE page
    migrations, a stable_node use after free could materialize.

    If the chain is collapsed stable_node would point to the old chain that
    was already freed. stable_node_dup would be the stable_node dup now
    converted to a regular stable_node and indexed in the rbtree in
    replacement of the freed stable_node chain (not anymore a dup).

    This special case where the chain is collapsed in the NUMA replacement
    path, is now detected by setting stable_node to NULL by the chain_prune
    callee if it decides to collapse the chain. This tells the NUMA
    replacement code that even if stable_node and stable_node_dup are
    different, this is not a chain if stable_node is NULL, as the
    stable_node_dup was converted to a regular stable_node and the chain was
    collapsed.

    It is generally safer for the callee to force the caller stable_node to
    NULL the moment it become stale so any other mistake like this would
    result in an instant Oops easier to debug than an use after free.

    Otherwise the replace logic would act like if stable_node was a valid
    chain, when in fact it was freed. Notably
    stable_node_chain_add_dup(page_node, stable_node) would run on a stable
    stable_node.

    Andrey Ryabinin found the source of the use after free in chain_prune().

    Link: http://lkml.kernel.org/r/20170512193805.8807-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Andrey Ryabinin
    Reported-by: Evgheni Dereveanchin
    Tested-by: Andrey Ryabinin
    Cc: Petr Holasek
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Arjan van de Ven
    Cc: Gavin Guo
    Cc: Jay Vosburgh
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Without a max deduplication limit for each KSM page, the list of the
    rmap_items associated to each stable_node can grow infinitely large.

    During the rmap walk each entry can take up to ~10usec to process
    because of IPIs for the TLB flushing (both for the primary MMU and the
    secondary MMUs with the MMU notifier). With only 16GB of address space
    shared in the same KSM page, that would amount to dozens of seconds of
    kernel runtime.

    A ~256 max deduplication factor will reduce the latencies of the rmap
    walks on KSM pages to order of a few msec. Just doing the
    cond_resched() during the rmap walks is not enough, the list size must
    have a limit too, otherwise the caller could get blocked in (schedule
    friendly) kernel computations for seconds, unexpectedly.

    There's room for optimization to significantly reduce the IPI delivery
    cost during the page_referenced(), but at least for page_migration in
    the KSM case (used by hard NUMA bindings, compaction and NUMA balancing)
    it may be inevitable to send lots of IPIs if each rmap_item->mm is
    active on a different CPU and there are lots of CPUs. Even if we ignore
    the IPI delivery cost, we've still to walk the whole KSM rmap list, so
    we can't allow millions or billions (ulimited) number of entries in the
    KSM stable_node rmap_item lists.

    The limit is enforced efficiently by adding a second dimension to the
    stable rbtree. So there are three types of stable_nodes: the regular
    ones (identical as before, living in the first flat dimension of the
    stable rbtree), the "chains" and the "dups".

    Every "chain" and all "dups" linked into a "chain" enforce the invariant
    that they represent the same write protected memory content, even if
    each "dup" will be pointed by a different KSM page copy of that content.
    This way the stable rbtree lookup computational complexity is unaffected
    if compared to an unlimited max_sharing_limit. It is still enforced
    that there cannot be KSM page content duplicates in the stable rbtree
    itself.

    Adding the second dimension to the stable rbtree only after the
    max_page_sharing limit hits, provides for a zero memory footprint
    increase on 64bit archs. The memory overhead of the per-KSM page
    stable_tree and per virtual mapping rmap_item is unchanged. Only after
    the max_page_sharing limit hits, we need to allocate a stable_tree
    "chain" and rb_replace() the "regular" stable_node with the newly
    allocated stable_node "chain". After that we simply add the "regular"
    stable_node to the chain as a stable_node "dup" by linking hlist_dup in
    the stable_node_chain->hlist. This way the "regular" (flat) stable_node
    is converted to a stable_node "dup" living in the second dimension of
    the stable rbtree.

    During stable rbtree lookups the stable_node "chain" is identified as
    stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
    is_stable_node_chain()).

    When dropping stable_nodes, the stable_node "dup" is identified as
    stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).

    The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
    elsewhere in any stable_node->head/node to avoid a clashes with the
    stable_node->node.rb_parent_color pointer, and different from
    &migrate_nodes. So the second field of &migrate_nodes is picked and
    verified as always safe with a BUILD_BUG_ON in case the list_head
    implementation changes in the future.

    The STABLE_NODE_DUP is picked as a random negative value in
    stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative when
    it's a "regular" stable_node or a stable_node "dup".

    The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn
    is aliased in a union with a time field used to rate limit the
    stable_node_chain->hlist prunes.

    The garbage collection of the stable_node_chain happens lazily during
    stable rbtree lookups (as for all other kind of stable_nodes), or while
    disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the
    entire stable rbtree.

    While the "regular" stable_nodes and the stable_node "dups" must wait
    for their underlying tree_page to be freed before they can be freed
    themselves, the stable_node "chains" can be freed immediately if the
    stable_node->hlist turns empty. This is because the "chains" are never
    pointed by any page->mapping and they're effectively stable rbtree KSM
    self contained metadata.

    [akpm@linux-foundation.org: fix non-NUMA build]
    Signed-off-by: Andrea Arcangeli
    Tested-by: Petr Holasek
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Arjan van de Ven
    Cc: Evgheni Dereveanchin
    Cc: Andrey Ryabinin
    Cc: Gavin Guo
    Cc: Jay Vosburgh
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

03 Jun, 2017

1 commit

  • "err" needs to be left set to -EFAULT if split_huge_page succeeds.
    Otherwise if "err" gets clobbered with zero and write_protect_page
    fails, try_to_merge_one_page() will succeed instead of returning -EFAULT
    and then try_to_merge_with_ksm_page() will continue thinking kpage is a
    PageKsm when in fact it's still an anonymous page. Eventually it'll
    crash in page_add_anon_rmap.

    This has been reproduced on Fedora25 kernel but I can reproduce with
    upstream too.

    The bug was introduced in commit f765f540598a ("ksm: prepare to new THP
    semantics") introduced in v4.5.

    page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673
    flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked)
    page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
    page->mem_cgroup:ffffa09674bf0000
    ------------[ cut here ]------------
    kernel BUG at mm/rmap.c:1222!
    CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1
    RIP: do_page_add_anon_rmap+0x1c4/0x240
    Call Trace:
    page_add_anon_rmap+0x18/0x20
    try_to_merge_with_ksm_page+0x50b/0x780
    ksm_scan_thread+0x1211/0x1410
    ? prepare_to_wait_event+0x100/0x100
    ? try_to_merge_with_ksm_page+0x780/0x780
    kthread+0xd9/0xf0
    ? kthread_park+0x60/0x60
    ret_from_fork+0x25/0x30

    Fixes: f765f54059 ("ksm: prepare to new THP semantics")
    Link: http://lkml.kernel.org/r/20170513131040.21732-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Federico Simoncelli
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

04 May, 2017

2 commits

  • rmap_one's return value controls whether rmap_work should contine to
    scan other ptes or not so it's target for changing to boolean. Return
    true if the scan should be continued. Otherwise, return false to stop
    the scanning.

    This patch makes rmap_one's return value to boolean.

    Link: http://lkml.kernel.org/r/1489555493-14659-10-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • There is no user of the return value from rmap_walk() and friends so
    this patch makes them void-returning functions.

    Link: http://lkml.kernel.org/r/1489555493-14659-9-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

02 Mar, 2017

2 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Apart from adding the helper function itself, the rest of the kernel is
    converted mechanically using:

    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    (Michal Hocko provided most of the kerneldoc comment.)

    Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

25 Feb, 2017

3 commits

  • Without this KSM will consider the page write protected, but a numa
    fault can later mark the page writable. This can result in memory
    corruption.

    Link: http://lkml.kernel.org/r/1487498625-10891-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • For consistency, it worth converting all page_check_address() to
    page_vma_mapped_walk(), so we could drop the former.

    Link: http://lkml.kernel.org/r/20170129173858.45174-9-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Some architectures have a set of zero pages (coloured zero pages)
    instead of only one zero page, in order to improve the cache
    performance. In those cases, the kernel samepage merger (KSM) would
    merge all the allocated pages that happen to be filled with zeroes to
    the same deduplicated page, thus losing all the advantages of coloured
    zero pages.

    This behaviour is noticeable when a process accesses large arrays of
    allocated pages containing zeroes. A test I conducted on s390 shows
    that there is a speed penalty when KSM merges such pages, compared to
    not merging them or using actual zero pages from the start without
    breaking the COW.

    This patch fixes this behaviour. When coloured zero pages are present,
    the checksum of a zero page is calculated during initialisation, and
    compared with the checksum of the current canditate during merging. In
    case of a match, the normal merging routine is used to merge the page
    with the correct coloured zero page, which ensures the candidate page is
    checked to be equal to the target zero page.

    A sysfs entry is also added to toggle this behaviour, since it can
    potentially introduce performance regressions, especially on
    architectures without coloured zero pages. The default value is
    disabled, for backwards compatibility.

    With this patch, the performance with KSM is the same as with non
    COW-broken actual zero pages, which is also the same as without KSM.

    [akpm@linux-foundation.org: make zero_checksum and ksm_use_zero_pages __read_mostly, per Andrea]
    [imbrenda@linux.vnet.ibm.com: documentation for coloured zero pages deduplication]
    Link: http://lkml.kernel.org/r/1484927522-1964-1-git-send-email-imbrenda@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1484850953-23941-1-git-send-email-imbrenda@linux.vnet.ibm.com
    Signed-off-by: Claudio Imbrenda
    Cc: Christian Borntraeger
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Claudio Imbrenda
     

08 Oct, 2016

1 commit

  • According to Hugh's suggestion, alloc_stable_node() with GFP_KERNEL can
    in rare cases cause a hung task warning.

    At present, if alloc_stable_node() allocation fails, two break_cows may
    want to allocate a couple of pages, and the issue will come up when free
    memory is under pressure.

    We fix it by adding __GFP_HIGH to GFP, to grant access to memory
    reserves, increasing the likelihood of allocation success.

    [akpm@linux-foundation.org: tweak comment]
    Link: http://lkml.kernel.org/r/1474354484-58233-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Suggested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

29 Sep, 2016

1 commit

  • I hit the following hung task when runing a OOM LTP test case with 4.1
    kernel.

    Call trace:
    [] __switch_to+0x74/0x8c
    [] __schedule+0x23c/0x7bc
    [] schedule+0x3c/0x94
    [] rwsem_down_write_failed+0x214/0x350
    [] down_write+0x64/0x80
    [] __ksm_exit+0x90/0x19c
    [] mmput+0x118/0x11c
    [] do_exit+0x2dc/0xa74
    [] do_group_exit+0x4c/0xe4
    [] get_signal+0x444/0x5e0
    [] do_signal+0x1d8/0x450
    [] do_notify_resume+0x70/0x78

    The oom victim cannot terminate because it needs to take mmap_sem for
    write while the lock is held by ksmd for read which loops in the page
    allocator

    ksm_do_scan
    scan_get_next_rmap_item
    down_read
    get_next_rmap_item
    alloc_rmap_item #ksmd will loop permanently.

    There is no way forward because the oom victim cannot release any memory
    in 4.1 based kernel. Since 4.6 we have the oom reaper which would solve
    this problem because it would release the memory asynchronously.
    Nevertheless we can relax alloc_rmap_item requirements and use
    __GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan
    would just retry later after the lock got dropped.

    Such a patch would be also easy to backport to older stable kernels which
    do not have oom_reaper.

    While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed
    by the allocation failure.

    Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Suggested-by: Hugh Dickins
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

27 Jul, 2016

2 commits

  • We always have vma->vm_mm around.

    Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We have allowed migration for only LRU pages until now and it was enough
    to make high-order pages. But recently, embedded system(e.g., webOS,
    android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
    have seen several reports about troubles of small high-order allocation.
    For fixing the problem, there were several efforts (e,g,. enhance
    compaction algorithm, SLUB fallback to 0-order page, reserved memory,
    vmalloc and so on) but if there are lots of non-movable pages in system,
    their solutions are void in the long run.

    So, this patch is to support facility to change non-movable pages with
    movable. For the feature, this patch introduces functions related to
    migration to address_space_operations as well as some page flags.

    If a driver want to make own pages movable, it should define three
    functions which are function pointers of struct
    address_space_operations.

    1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

    What VM expects on isolate_page function of driver is to return *true*
    if driver isolates page successfully. On returing true, VM marks the
    page as PG_isolated so concurrent isolation in several CPUs skip the
    page for isolation. If a driver cannot isolate the page, it should
    return *false*.

    Once page is successfully isolated, VM uses page.lru fields so driver
    shouldn't expect to preserve values in that fields.

    2. int (*migratepage) (struct address_space *mapping,
    struct page *newpage, struct page *oldpage, enum migrate_mode);

    After isolation, VM calls migratepage of driver with isolated page. The
    function of migratepage is to move content of the old page to new page
    and set up fields of struct page newpage. Keep in mind that you should
    indicate to the VM the oldpage is no longer movable via
    __ClearPageMovable() under page_lock if you migrated the oldpage
    successfully and returns 0. If driver cannot migrate the page at the
    moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
    migration in a short time because VM interprets -EAGAIN as "temporal
    migration failure". On returning any error except -EAGAIN, VM will give
    up the page migration without retrying in this time.

    Driver shouldn't touch page.lru field VM using in the functions.

    3. void (*putback_page)(struct page *);

    If migration fails on isolated page, VM should return the isolated page
    to the driver so VM calls driver's putback_page with migration failed
    page. In this function, driver should put the isolated page back to the
    own data structure.

    4. non-lru movable page flags

    There are two page flags for supporting non-lru movable page.

    * PG_movable

    Driver should use the below function to make page movable under
    page_lock.

    void __SetPageMovable(struct page *page, struct address_space *mapping)

    It needs argument of address_space for registering migration family
    functions which will be called by VM. Exactly speaking, PG_movable is
    not a real flag of struct page. Rather than, VM reuses page->mapping's
    lower bits to represent it.

    #define PAGE_MAPPING_MOVABLE 0x2
    page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

    so driver shouldn't access page->mapping directly. Instead, driver
    should use page_mapping which mask off the low two bits of page->mapping
    so it can get right struct address_space.

    For testing of non-lru movable page, VM supports __PageMovable function.
    However, it doesn't guarantee to identify non-lru movable page because
    page->mapping field is unified with other variables in struct page. As
    well, if driver releases the page after isolation by VM, page->mapping
    doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
    __ClearPageMovable). But __PageMovable is cheap to catch whether page
    is LRU or non-lru movable once the page has been isolated. Because LRU
    pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
    good for just peeking to test non-lru movable pages before more
    expensive checking with lock_page in pfn scanning to select victim.

    For guaranteeing non-lru movable page, VM provides PageMovable function.
    Unlike __PageMovable, PageMovable functions validates page->mapping and
    mapping->a_ops->isolate_page under lock_page. The lock_page prevents
    sudden destroying of page->mapping.

    Driver using __SetPageMovable should clear the flag via
    __ClearMovablePage under page_lock before the releasing the page.

    * PG_isolated

    To prevent concurrent isolation among several CPUs, VM marks isolated
    page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
    non-lru movable page, it can skip it. Driver doesn't need to manipulate
    the flag because VM will set/clear it automatically. Keep in mind that
    if driver sees PG_isolated page, it means the page have been isolated by
    VM so it shouldn't touch page.lru field. PG_isolated is alias with
    PG_reclaim flag so driver shouldn't use the flag for own purpose.

    [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
    Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
    Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Signed-off-by: Ganesh Mahendran
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Cc: Jonathan Corbet
    Cc: John Einar Reitan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

13 May, 2016

1 commit

  • A concurrency issue about KSM in the function scan_get_next_rmap_item.

    task A (ksmd): |task B (the mm's task):
    |
    mm = slot->mm; |
    down_read(&mm->mmap_sem); |
    |
    ... |
    |
    spin_lock(&ksm_mmlist_lock); |
    |
    ksm_scan.mm_slot go to the next slot; |
    |
    spin_unlock(&ksm_mmlist_lock); |
    |mmput() ->
    | ksm_exit():
    |
    |spin_lock(&ksm_mmlist_lock);
    |if (mm_slot && ksm_scan.mm_slot != mm_slot) {
    | if (!mm_slot->rmap_list) {
    | easy_to_free = 1;
    | ...
    |
    |if (easy_to_free) {
    | mmdrop(mm);
    | ...
    |
    |So this mm_struct may be freed in the mmput().
    |
    up_read(&mm->mmap_sem); |

    As we can see above, the ksmd thread may access a mm_struct that already
    been freed to the kmem_cache. Suppose a fork will get this mm_struct from
    the kmem_cache, the ksmd thread then call up_read(&mm->mmap_sem), will
    cause mmap_sem.count to become -1.

    As suggested by Andrea Arcangeli, unmerge_and_remove_all_rmap_items has
    the same SMP race condition, so fix it too. My prev fix in function
    scan_get_next_rmap_item will introduce a different SMP race condition, so
    just invert the up_read/spin_unlock order as Andrea Arcangeli said.

    Link: http://lkml.kernel.org/r/1462708815-31301-1-git-send-email-zhouchengming1@huawei.com
    Signed-off-by: Zhou Chengming
    Suggested-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Geliang Tang
    Cc: Minchan Kim
    Cc: Hanjun Guo
    Cc: Ding Tianhong
    Cc: Li Bin
    Cc: Zhen Lei
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhou Chengming
     

19 Feb, 2016

1 commit

  • We try to enforce protection keys in software the same way that we
    do in hardware. (See long example below).

    But, we only want to do this when accessing our *own* process's
    memory. If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
    tried to PTRACE_POKE a target process which just happened to have
    some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
    debugger access to that memory. PKRU is fundamentally a
    thread-local structure and we do not want to enforce it on access
    to _another_ thread's data.

    This gets especially tricky when we have workqueues or other
    delayed-work mechanisms that might run in a random process's context.
    We can check that we only enforce pkeys when operating on our *own* mm,
    but delayed work gets performed when a random user context is active.
    We might end up with a situation where a delayed-work gup fails when
    running randomly under its "own" task but succeeds when running under
    another process. We want to avoid that.

    To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
    fault flag: FAULT_FLAG_REMOTE. They indicate that we are
    walking an mm which is not guranteed to be the same as
    current->mm and should not be subject to protection key
    enforcement.

    Thanks to Jerome Glisse for pointing out this scenario.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Alexey Kardashevskiy
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Boaz Harrosh
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Gibson
    Cc: Denys Vlasenko
    Cc: Dominik Dingel
    Cc: Dominik Vogt
    Cc: Eric B Munson
    Cc: Geliang Tang
    Cc: Guan Xuetao
    Cc: H. Peter Anvin
    Cc: Heiko Carstens
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: Jason Low
    Cc: Jerome Marchand
    Cc: Joerg Roedel
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Laurent Dufour
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Mikulas Patocka
    Cc: Minchan Kim
    Cc: Oleg Nesterov
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sasha Levin
    Cc: Shachar Raindel
    Cc: Vlastimil Babka
    Cc: Xie XiuQi
    Cc: iommu@lists.linux-foundation.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linux-s390@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

16 Feb, 2016

1 commit

  • We will soon modify the vanilla get_user_pages() so it can no
    longer be used on mm/tasks other than 'current/current->mm',
    which is by far the most common way it is called. For now,
    we allow the old-style calls, but warn when they are used.
    (implemented in previous patch)

    This patch switches all callers of:

    get_user_pages()
    get_user_pages_unlocked()
    get_user_pages_locked()

    to stop passing tsk/mm so they will no longer see the warnings.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vlastimil Babka
    Cc: jack@suse.cz
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

16 Jan, 2016

4 commits

  • The MADV_FREE patchset changes page reclaim to simply free a clean
    anonymous page with no dirty ptes, instead of swapping it out; but KSM
    uses clean write-protected ptes to reference the stable ksm page. So be
    sure to mark that page dirty, so it's never mistakenly discarded.

    [hughd@google.com: adjusted comments]
    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Kirill A. Shutemov
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Cc: Mika Penttil
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Rik van Riel
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We don't need special code to stabilize THP. If you've got reference to
    any subpage of THP it will not be split under you.

    New split_huge_page() also accepts tail pages: no need in special code
    to get reference to head page.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to allow mapping of individual 4k pages of THP compound
    page. It means we cannot rely on PageTransHuge() check to decide if
    map/unmap small page or THP.

    The patch adds new argument to rmap functions to indicate whether we
    want to operate on whole compound page or only the small page.

    [n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • lock_page() must operate on the whole compound page. It doesn't make
    much sense to lock part of compound page. Change code to use head
    page's PG_locked, if tail page is passed.

    This patch also gets rid of custom helper functions --
    __set_page_locked() and __clear_page_locked(). They are replaced with
    helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
    helper would trigger VM_BUG_ON().

    SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
    appear there. VM_BUG_ON() is added to make sure that this assumption is
    correct.

    [akpm@linux-foundation.org: fix fs/cifs/file.c]
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit


06 Nov, 2015

5 commits

  • get_mergeable_page() can only return NULL (also in case of errors) or the
    pinned mergeable page. It can't return an error different than NULL.
    This optimizes away the unnecessary error check.

    Add a return after the "out:" label in the callee to make it more
    readable.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Doing the VM_MERGEABLE check after the page == kpage check won't provide
    any meaningful benefit. The !vma->anon_vma check of find_mergeable_vma is
    the only superfluous bit in using find_mergeable_vma because the !PageAnon
    check of try_to_merge_one_page() implicitly checks for that, but it still
    looks cleaner to share the same find_mergeable_vma().

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This just uses the helper function to cleanup the assumption on the
    hlist_node internals.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The stable_nodes can become stale at any time if the underlying pages gets
    freed. The stable_node gets collected and removed from the stable rbtree
    if that is detected during the rbtree lookups.

    Don't fail the lookup if running into stale stable_nodes, just restart the
    lookup after collecting the stale stable_nodes. Otherwise the CPU spent
    in the preparation stage is wasted and the lookup must be repeated at the
    next loop potentially failing a second time in a second stale stable_node.

    If we don't prune aggressively we delay the merging of the unstable node
    candidates and at the same time we delay the freeing of the stale
    stable_nodes. Keeping stale stable_nodes around wastes memory and it
    can't provide any benefit.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • While at it add it to the file and anon walks too.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

16 Apr, 2015

1 commit

  • We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
    tree since it doesn't work reliably on non-scalar types.

    This patch removes the rest of the usages of ACCESS_ONCE, and use the new
    READ_ONCE API for the read accesses. This makes things cleaner, instead
    of using separate/multiple sets of APIs.

    Signed-off-by: Jason Low
    Acked-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Low
     

11 Feb, 2015

1 commit


30 Jan, 2015

1 commit

  • The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
    "you should SIGSEGV" error, because the SIGSEGV case was generally
    handled by the caller - usually the architecture fault handler.

    That results in lots of duplication - all the architecture fault
    handlers end up doing very similar "look up vma, check permissions, do
    retries etc" - but it generally works. However, there are cases where
    the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.

    In particular, when accessing the stack guard page, libsigsegv expects a
    SIGSEGV. And it usually got one, because the stack growth is handled by
    that duplicated architecture fault handler.

    However, when the generic VM layer started propagating the error return
    from the stack expansion in commit fee7e49d4514 ("mm: propagate error
    from stack expansion even for guard page"), that now exposed the
    existing VM_FAULT_SIGBUS result to user space. And user space really
    expected SIGSEGV, not SIGBUS.

    To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
    duplicate architecture fault handlers about it. They all already have
    the code to handle SIGSEGV, so it's about just tying that new return
    value to the existing code, but it's all a bit annoying.

    This is the mindless minimal patch to do this. A more extensive patch
    would be to try to gather up the mostly shared fault handling logic into
    one generic helper routine, and long-term we really should do that
    cleanup.

    Just from this patch, you can generally see that most architectures just
    copied (directly or indirectly) the old x86 way of doing things, but in
    the meantime that original x86 model has been improved to hold the VM
    semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
    "newer" things, so it would be a good idea to bring all those
    improvements to the generic case and teach other architectures about
    them too.

    Reported-and-tested-by: Takashi Iwai
    Tested-by: Jan Engelhardt
    Acked-by: Heiko Carstens # "s390 still compiles and boots"
    Cc: linux-arch@vger.kernel.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds