07 Jun, 2017

1 commit

  • commit a7306c3436e9c8e584a4b9fad5f3dc91be2a6076 upstream.

    "err" needs to be left set to -EFAULT if split_huge_page succeeds.
    Otherwise if "err" gets clobbered with zero and write_protect_page
    fails, try_to_merge_one_page() will succeed instead of returning -EFAULT
    and then try_to_merge_with_ksm_page() will continue thinking kpage is a
    PageKsm when in fact it's still an anonymous page. Eventually it'll
    crash in page_add_anon_rmap.

    This has been reproduced on Fedora25 kernel but I can reproduce with
    upstream too.

    The bug was introduced in commit f765f540598a ("ksm: prepare to new THP
    semantics") introduced in v4.5.

    page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673
    flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked)
    page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
    page->mem_cgroup:ffffa09674bf0000
    ------------[ cut here ]------------
    kernel BUG at mm/rmap.c:1222!
    CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1
    RIP: do_page_add_anon_rmap+0x1c4/0x240
    Call Trace:
    page_add_anon_rmap+0x18/0x20
    try_to_merge_with_ksm_page+0x50b/0x780
    ksm_scan_thread+0x1211/0x1410
    ? prepare_to_wait_event+0x100/0x100
    ? try_to_merge_with_ksm_page+0x780/0x780
    kthread+0xd9/0xf0
    ? kthread_park+0x60/0x60
    ret_from_fork+0x25/0x30

    Fixes: f765f54059 ("ksm: prepare to new THP semantics")
    Link: http://lkml.kernel.org/r/20170513131040.21732-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Federico Simoncelli
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

08 Oct, 2016

1 commit

  • According to Hugh's suggestion, alloc_stable_node() with GFP_KERNEL can
    in rare cases cause a hung task warning.

    At present, if alloc_stable_node() allocation fails, two break_cows may
    want to allocate a couple of pages, and the issue will come up when free
    memory is under pressure.

    We fix it by adding __GFP_HIGH to GFP, to grant access to memory
    reserves, increasing the likelihood of allocation success.

    [akpm@linux-foundation.org: tweak comment]
    Link: http://lkml.kernel.org/r/1474354484-58233-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Suggested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

29 Sep, 2016

1 commit

  • I hit the following hung task when runing a OOM LTP test case with 4.1
    kernel.

    Call trace:
    [] __switch_to+0x74/0x8c
    [] __schedule+0x23c/0x7bc
    [] schedule+0x3c/0x94
    [] rwsem_down_write_failed+0x214/0x350
    [] down_write+0x64/0x80
    [] __ksm_exit+0x90/0x19c
    [] mmput+0x118/0x11c
    [] do_exit+0x2dc/0xa74
    [] do_group_exit+0x4c/0xe4
    [] get_signal+0x444/0x5e0
    [] do_signal+0x1d8/0x450
    [] do_notify_resume+0x70/0x78

    The oom victim cannot terminate because it needs to take mmap_sem for
    write while the lock is held by ksmd for read which loops in the page
    allocator

    ksm_do_scan
    scan_get_next_rmap_item
    down_read
    get_next_rmap_item
    alloc_rmap_item #ksmd will loop permanently.

    There is no way forward because the oom victim cannot release any memory
    in 4.1 based kernel. Since 4.6 we have the oom reaper which would solve
    this problem because it would release the memory asynchronously.
    Nevertheless we can relax alloc_rmap_item requirements and use
    __GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan
    would just retry later after the lock got dropped.

    Such a patch would be also easy to backport to older stable kernels which
    do not have oom_reaper.

    While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed
    by the allocation failure.

    Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Suggested-by: Hugh Dickins
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

27 Jul, 2016

2 commits

  • We always have vma->vm_mm around.

    Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We have allowed migration for only LRU pages until now and it was enough
    to make high-order pages. But recently, embedded system(e.g., webOS,
    android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
    have seen several reports about troubles of small high-order allocation.
    For fixing the problem, there were several efforts (e,g,. enhance
    compaction algorithm, SLUB fallback to 0-order page, reserved memory,
    vmalloc and so on) but if there are lots of non-movable pages in system,
    their solutions are void in the long run.

    So, this patch is to support facility to change non-movable pages with
    movable. For the feature, this patch introduces functions related to
    migration to address_space_operations as well as some page flags.

    If a driver want to make own pages movable, it should define three
    functions which are function pointers of struct
    address_space_operations.

    1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

    What VM expects on isolate_page function of driver is to return *true*
    if driver isolates page successfully. On returing true, VM marks the
    page as PG_isolated so concurrent isolation in several CPUs skip the
    page for isolation. If a driver cannot isolate the page, it should
    return *false*.

    Once page is successfully isolated, VM uses page.lru fields so driver
    shouldn't expect to preserve values in that fields.

    2. int (*migratepage) (struct address_space *mapping,
    struct page *newpage, struct page *oldpage, enum migrate_mode);

    After isolation, VM calls migratepage of driver with isolated page. The
    function of migratepage is to move content of the old page to new page
    and set up fields of struct page newpage. Keep in mind that you should
    indicate to the VM the oldpage is no longer movable via
    __ClearPageMovable() under page_lock if you migrated the oldpage
    successfully and returns 0. If driver cannot migrate the page at the
    moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
    migration in a short time because VM interprets -EAGAIN as "temporal
    migration failure". On returning any error except -EAGAIN, VM will give
    up the page migration without retrying in this time.

    Driver shouldn't touch page.lru field VM using in the functions.

    3. void (*putback_page)(struct page *);

    If migration fails on isolated page, VM should return the isolated page
    to the driver so VM calls driver's putback_page with migration failed
    page. In this function, driver should put the isolated page back to the
    own data structure.

    4. non-lru movable page flags

    There are two page flags for supporting non-lru movable page.

    * PG_movable

    Driver should use the below function to make page movable under
    page_lock.

    void __SetPageMovable(struct page *page, struct address_space *mapping)

    It needs argument of address_space for registering migration family
    functions which will be called by VM. Exactly speaking, PG_movable is
    not a real flag of struct page. Rather than, VM reuses page->mapping's
    lower bits to represent it.

    #define PAGE_MAPPING_MOVABLE 0x2
    page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

    so driver shouldn't access page->mapping directly. Instead, driver
    should use page_mapping which mask off the low two bits of page->mapping
    so it can get right struct address_space.

    For testing of non-lru movable page, VM supports __PageMovable function.
    However, it doesn't guarantee to identify non-lru movable page because
    page->mapping field is unified with other variables in struct page. As
    well, if driver releases the page after isolation by VM, page->mapping
    doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
    __ClearPageMovable). But __PageMovable is cheap to catch whether page
    is LRU or non-lru movable once the page has been isolated. Because LRU
    pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
    good for just peeking to test non-lru movable pages before more
    expensive checking with lock_page in pfn scanning to select victim.

    For guaranteeing non-lru movable page, VM provides PageMovable function.
    Unlike __PageMovable, PageMovable functions validates page->mapping and
    mapping->a_ops->isolate_page under lock_page. The lock_page prevents
    sudden destroying of page->mapping.

    Driver using __SetPageMovable should clear the flag via
    __ClearMovablePage under page_lock before the releasing the page.

    * PG_isolated

    To prevent concurrent isolation among several CPUs, VM marks isolated
    page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
    non-lru movable page, it can skip it. Driver doesn't need to manipulate
    the flag because VM will set/clear it automatically. Keep in mind that
    if driver sees PG_isolated page, it means the page have been isolated by
    VM so it shouldn't touch page.lru field. PG_isolated is alias with
    PG_reclaim flag so driver shouldn't use the flag for own purpose.

    [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
    Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
    Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Signed-off-by: Ganesh Mahendran
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Cc: Jonathan Corbet
    Cc: John Einar Reitan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

13 May, 2016

1 commit

  • A concurrency issue about KSM in the function scan_get_next_rmap_item.

    task A (ksmd): |task B (the mm's task):
    |
    mm = slot->mm; |
    down_read(&mm->mmap_sem); |
    |
    ... |
    |
    spin_lock(&ksm_mmlist_lock); |
    |
    ksm_scan.mm_slot go to the next slot; |
    |
    spin_unlock(&ksm_mmlist_lock); |
    |mmput() ->
    | ksm_exit():
    |
    |spin_lock(&ksm_mmlist_lock);
    |if (mm_slot && ksm_scan.mm_slot != mm_slot) {
    | if (!mm_slot->rmap_list) {
    | easy_to_free = 1;
    | ...
    |
    |if (easy_to_free) {
    | mmdrop(mm);
    | ...
    |
    |So this mm_struct may be freed in the mmput().
    |
    up_read(&mm->mmap_sem); |

    As we can see above, the ksmd thread may access a mm_struct that already
    been freed to the kmem_cache. Suppose a fork will get this mm_struct from
    the kmem_cache, the ksmd thread then call up_read(&mm->mmap_sem), will
    cause mmap_sem.count to become -1.

    As suggested by Andrea Arcangeli, unmerge_and_remove_all_rmap_items has
    the same SMP race condition, so fix it too. My prev fix in function
    scan_get_next_rmap_item will introduce a different SMP race condition, so
    just invert the up_read/spin_unlock order as Andrea Arcangeli said.

    Link: http://lkml.kernel.org/r/1462708815-31301-1-git-send-email-zhouchengming1@huawei.com
    Signed-off-by: Zhou Chengming
    Suggested-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Geliang Tang
    Cc: Minchan Kim
    Cc: Hanjun Guo
    Cc: Ding Tianhong
    Cc: Li Bin
    Cc: Zhen Lei
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhou Chengming
     

19 Feb, 2016

1 commit

  • We try to enforce protection keys in software the same way that we
    do in hardware. (See long example below).

    But, we only want to do this when accessing our *own* process's
    memory. If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
    tried to PTRACE_POKE a target process which just happened to have
    some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
    debugger access to that memory. PKRU is fundamentally a
    thread-local structure and we do not want to enforce it on access
    to _another_ thread's data.

    This gets especially tricky when we have workqueues or other
    delayed-work mechanisms that might run in a random process's context.
    We can check that we only enforce pkeys when operating on our *own* mm,
    but delayed work gets performed when a random user context is active.
    We might end up with a situation where a delayed-work gup fails when
    running randomly under its "own" task but succeeds when running under
    another process. We want to avoid that.

    To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
    fault flag: FAULT_FLAG_REMOTE. They indicate that we are
    walking an mm which is not guranteed to be the same as
    current->mm and should not be subject to protection key
    enforcement.

    Thanks to Jerome Glisse for pointing out this scenario.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Alexey Kardashevskiy
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Boaz Harrosh
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Gibson
    Cc: Denys Vlasenko
    Cc: Dominik Dingel
    Cc: Dominik Vogt
    Cc: Eric B Munson
    Cc: Geliang Tang
    Cc: Guan Xuetao
    Cc: H. Peter Anvin
    Cc: Heiko Carstens
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: Jason Low
    Cc: Jerome Marchand
    Cc: Joerg Roedel
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Laurent Dufour
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Mikulas Patocka
    Cc: Minchan Kim
    Cc: Oleg Nesterov
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sasha Levin
    Cc: Shachar Raindel
    Cc: Vlastimil Babka
    Cc: Xie XiuQi
    Cc: iommu@lists.linux-foundation.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linux-s390@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

16 Feb, 2016

1 commit

  • We will soon modify the vanilla get_user_pages() so it can no
    longer be used on mm/tasks other than 'current/current->mm',
    which is by far the most common way it is called. For now,
    we allow the old-style calls, but warn when they are used.
    (implemented in previous patch)

    This patch switches all callers of:

    get_user_pages()
    get_user_pages_unlocked()
    get_user_pages_locked()

    to stop passing tsk/mm so they will no longer see the warnings.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vlastimil Babka
    Cc: jack@suse.cz
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

16 Jan, 2016

4 commits

  • The MADV_FREE patchset changes page reclaim to simply free a clean
    anonymous page with no dirty ptes, instead of swapping it out; but KSM
    uses clean write-protected ptes to reference the stable ksm page. So be
    sure to mark that page dirty, so it's never mistakenly discarded.

    [hughd@google.com: adjusted comments]
    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Kirill A. Shutemov
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Cc: Mika Penttil
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Rik van Riel
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We don't need special code to stabilize THP. If you've got reference to
    any subpage of THP it will not be split under you.

    New split_huge_page() also accepts tail pages: no need in special code
    to get reference to head page.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to allow mapping of individual 4k pages of THP compound
    page. It means we cannot rely on PageTransHuge() check to decide if
    map/unmap small page or THP.

    The patch adds new argument to rmap functions to indicate whether we
    want to operate on whole compound page or only the small page.

    [n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • lock_page() must operate on the whole compound page. It doesn't make
    much sense to lock part of compound page. Change code to use head
    page's PG_locked, if tail page is passed.

    This patch also gets rid of custom helper functions --
    __set_page_locked() and __clear_page_locked(). They are replaced with
    helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
    helper would trigger VM_BUG_ON().

    SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
    appear there. VM_BUG_ON() is added to make sure that this assumption is
    correct.

    [akpm@linux-foundation.org: fix fs/cifs/file.c]
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit


06 Nov, 2015

5 commits

  • get_mergeable_page() can only return NULL (also in case of errors) or the
    pinned mergeable page. It can't return an error different than NULL.
    This optimizes away the unnecessary error check.

    Add a return after the "out:" label in the callee to make it more
    readable.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Doing the VM_MERGEABLE check after the page == kpage check won't provide
    any meaningful benefit. The !vma->anon_vma check of find_mergeable_vma is
    the only superfluous bit in using find_mergeable_vma because the !PageAnon
    check of try_to_merge_one_page() implicitly checks for that, but it still
    looks cleaner to share the same find_mergeable_vma().

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This just uses the helper function to cleanup the assumption on the
    hlist_node internals.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The stable_nodes can become stale at any time if the underlying pages gets
    freed. The stable_node gets collected and removed from the stable rbtree
    if that is detected during the rbtree lookups.

    Don't fail the lookup if running into stale stable_nodes, just restart the
    lookup after collecting the stale stable_nodes. Otherwise the CPU spent
    in the preparation stage is wasted and the lookup must be repeated at the
    next loop potentially failing a second time in a second stale stable_node.

    If we don't prune aggressively we delay the merging of the unstable node
    candidates and at the same time we delay the freeing of the stale
    stable_nodes. Keeping stale stable_nodes around wastes memory and it
    can't provide any benefit.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • While at it add it to the file and anon walks too.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

16 Apr, 2015

1 commit

  • We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
    tree since it doesn't work reliably on non-scalar types.

    This patch removes the rest of the usages of ACCESS_ONCE, and use the new
    READ_ONCE API for the read accesses. This makes things cleaner, instead
    of using separate/multiple sets of APIs.

    Signed-off-by: Jason Low
    Acked-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Low
     

11 Feb, 2015

1 commit


30 Jan, 2015

1 commit

  • The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
    "you should SIGSEGV" error, because the SIGSEGV case was generally
    handled by the caller - usually the architecture fault handler.

    That results in lots of duplication - all the architecture fault
    handlers end up doing very similar "look up vma, check permissions, do
    retries etc" - but it generally works. However, there are cases where
    the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.

    In particular, when accessing the stack guard page, libsigsegv expects a
    SIGSEGV. And it usually got one, because the stack growth is handled by
    that duplicated architecture fault handler.

    However, when the generic VM layer started propagating the error return
    from the stack expansion in commit fee7e49d4514 ("mm: propagate error
    from stack expansion even for guard page"), that now exposed the
    existing VM_FAULT_SIGBUS result to user space. And user space really
    expected SIGSEGV, not SIGBUS.

    To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
    duplicate architecture fault handlers about it. They all already have
    the code to handle SIGSEGV, so it's about just tying that new return
    value to the existing code, but it's all a bit annoying.

    This is the mindless minimal patch to do this. A more extensive patch
    would be to try to gather up the mostly shared fault handling logic into
    one generic helper routine, and long-term we really should do that
    cleanup.

    Just from this patch, you can generally see that most architectures just
    copied (directly or indirectly) the old x86 way of doing things, but in
    the meantime that original x86 model has been improved to hold the VM
    semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
    "newer" things, so it would be a good idea to bring all those
    improvements to the generic case and teach other architectures about
    them too.

    Reported-and-tested-by: Takashi Iwai
    Tested-by: Jan Engelhardt
    Acked-by: Heiko Carstens # "s390 still compiles and boots"
    Cc: linux-arch@vger.kernel.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Nov, 2014

1 commit

  • Add calls to the new mmu_notifier_invalidate_range() function to all
    places in the VMM that need it.

    Signed-off-by: Joerg Roedel
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Jérôme Glisse
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Jay Cornwall
    Cc: Oded Gabbay
    Cc: Suravee Suthikulpanit
    Cc: Jesse Barnes
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Oded Gabbay

    Joerg Roedel
     

10 Oct, 2014

1 commit


16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

24 Jun, 2014

1 commit

  • Trinity has reported:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
    CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W
    3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
    lock_acquire (arch/x86/include/asm/current.h:14
    kernel/locking/lockdep.c:3602)
    _raw_spin_lock (include/linux/spinlock_api_smp.h:143
    kernel/locking/spinlock.c:151)
    remove_migration_pte (mm/migrate.c:137)
    rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
    remove_migration_ptes (mm/migrate.c:224)
    migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
    migrate_misplaced_page (mm/migrate.c:1733)
    __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
    handle_mm_fault (mm/memory.c:3948)
    __get_user_pages (mm/memory.c:1851)
    __mlock_vma_pages_range (mm/mlock.c:255)
    __mm_populate (mm/mlock.c:711)
    SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)

    I believe this comes about because, whereas collapsing and splitting THP
    functions take anon_vma lock in write mode (which excludes concurrent
    rmap walks), faulting THP functions (write protection and misplaced
    NUMA) do not - and mostly they do not need to.

    But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
    an instant (indeed, for a long instant, given the inter-CPU TLB flush in
    there), leaves *pmd neither present not trans_huge.

    Which can confuse a concurrent rmap walk, as when removing migration
    ptes, seen in the dumped trace. Although that rmap walk has a 4k page
    to insert, anon_vmas containing THPs are in no way segregated from
    4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
    that instant when a trans_huge pmd is temporarily absent.

    I don't think we need strengthen the locking at the THP end: it's easily
    handled with an ACCESS_ONCE() before testing both conditions.

    And since mm_find_pmd() had only one caller who wanted a THP rather than
    a pmd, let's slightly repurpose it to fail when it hits a THP or
    non-present pmd, and open code split_huge_page_address() again.

    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: Bob Liu
    Cc: Christoph Lameter
    Cc: Dave Jones
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 Mar, 2014

1 commit

  • Commit bf6bddf1924e ("mm: introduce compaction and migration for
    ballooned pages") introduces page_count(page) into memory compaction
    which dereferences page->first_page if PageTail(page).

    This results in a very rare NULL pointer dereference on the
    aforementioned page_count(page). Indeed, anything that does
    compound_head(), including page_count() is susceptible to racing with
    prep_compound_page() and seeing a NULL or dangling page->first_page
    pointer.

    This patch uses Andrea's implementation of compound_trans_head() that
    deals with such a race and makes it the default compound_head()
    implementation. This includes a read memory barrier that ensures that
    if PageTail(head) is true that we return a head page that is neither
    NULL nor dangling. The patch then adds a store memory barrier to
    prep_compound_page() to ensure page->first_page is set.

    This is the safest way to ensure we see the head page that we are
    expecting, PageTail(page) is already in the unlikely() path and the
    memory barriers are unfortunately required.

    Hugetlbfs is the exception, we don't enforce a store memory barrier
    during init since no race is possible.

    Signed-off-by: David Rientjes
    Cc: Holger Kiehl
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

24 Jan, 2014

2 commits

  • Code that is obj-y (always built-in) or dependent on a bool Kconfig
    (built-in or absent) can never be modular. So using module_init as an
    alias for __initcall can be somewhat misleading.

    Fix these up now, so that we can relocate module_init from init.h into
    module.h in the future. If we don't do this, we'd have to add module.h
    to obviously non-modular code, and that would be a worse thing.

    The audit targets the following module_init users for change:
    mm/ksm.c bool KSM
    mm/mmap.c bool MMU
    mm/huge_memory.c bool TRANSPARENT_HUGEPAGE
    mm/mmu_notifier.c bool MMU_NOTIFIER

    Note that direct use of __initcall is discouraged, vs. one of the
    priority categorized subgroups. As __initcall gets mapped onto
    device_initcall, our use of subsys_initcall (which makes sense for these
    files) will thus change this registration from level 6-device to level
    4-subsys (i.e. slightly earlier).

    However no observable impact of that difference has been observed during
    testing.

    One might think that core_initcall (l2) or postcore_initcall (l3) would
    be more appropriate for anything in mm/ but if we look at some actual
    init functions themselves, we see things like:

    mm/huge_memory.c --> hugepage_init --> hugepage_init_sysfs
    mm/mmap.c --> init_user_reserve --> sysctl_user_reserve_kbytes
    mm/ksm.c --> ksm_init --> sysfs_create_group

    and hence the choice of subsys_initcall (l4) seems reasonable, and at
    the same time minimizes the risk of changing the priority too
    drastically all at once. We can adjust further in the future.

    Also, several instances of missing ";" at EOL are fixed.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Jan, 2014

5 commits

  • Now, we have an infrastructure in rmap_walk() to handle difference from
    variants of rmap traversing functions.

    So, just use it in page_referenced().

    In this patch, I change following things.

    1. remove some variants of rmap traversing functions.
    cf> page_referenced_ksm, page_referenced_anon,
    page_referenced_file

    2. introduce new struct page_referenced_arg and pass it to
    page_referenced_one(), main function of rmap_walk, in order to count
    reference, to store vm_flags and to check finish condition.

    3. mechanical change to use rmap_walk() in page_referenced().

    [liwanp@linux.vnet.ibm.com: fix BUG at rmap_walk]
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Wanpeng Li
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, we have an infrastructure in rmap_walk() to handle difference from
    variants of rmap traversing functions.

    So, just use it in try_to_munlock().

    In this patch, I change following things.

    1. remove some variants of rmap traversing functions.
    cf> try_to_unmap_ksm, try_to_unmap_anon, try_to_unmap_file
    2. mechanical change to use rmap_walk() in try_to_munlock().
    3. copy and paste comments.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, we have an infrastructure in rmap_walk() to handle difference from
    variants of rmap traversing functions.

    So, just use it in try_to_unmap().

    In this patch, I change following things.

    1. enable rmap_walk() if !CONFIG_MIGRATION.
    2. mechanical change to use rmap_walk() in try_to_unmap().

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There are a lot of common parts in traversing functions, but there are
    also a little of uncommon parts in it. By assigning proper function
    pointer on each rmap_walker_control, we can handle these difference
    correctly.

    Following are differences we should handle.

    1. difference of lock function in anon mapping case
    2. nonlinear handling in file mapping case
    3. prechecked condition:
    checking memcg in page_referenced(),
    checking VM_SHARE in page_mkclean()
    checking temporary vma in try_to_unmap()
    4. exit condition:
    checking page_mapped() in try_to_unmap()

    So, in this patch, I introduce 4 function pointers to handle above
    differences.

    Signed-off-by: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In each rmap traverse case, there is some difference so that we need
    function pointers and arguments to them in order to handle these

    For this purpose, struct rmap_walk_control is introduced in this patch,
    and will be extended in following patch. Introducing and extending are
    separate, because it clarify changes.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

13 Nov, 2013

1 commit


12 Sep, 2013

1 commit


09 Mar, 2013

1 commit

  • A CONFIG_DISCONTIGMEM=y m68k config gave

    mm/ksm.c: In function `get_kpfn_nid':
    mm/ksm.c:492: error: implicit declaration of function `pfn_to_nid'

    linux/mmzone.h declares it for CONFIG_SPARSEMEM and CONFIG_FLATMEM, but
    expects the arch's asm/mmzone.h to declare it for CONFIG_DISCONTIGMEM
    (see arch/mips/include/asm/mmzone.h for example).

    Or perhaps it is only expected when CONFIG_NUMA=y: too much of a maze,
    and m68k got away without it so far, so fix the build in mm/ksm.c.

    Signed-off-by: Hugh Dickins
    Reported-by: Geert Uytterhoeven
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

24 Feb, 2013

3 commits

  • It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically
    allocated, particularly when very few users will ever actually tune
    merge_across_nodes 0 to use more than 1+1 of those trees. Not a big
    deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity.

    Start off with 1+1 statically allocated, then if merge_across_nodes is
    ever tuned, allocate for nr_node_ids+nr_node_ids. Do not attempt to
    free up the extra if it's tuned back, that would be a waste of effort.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In "ksm: remove old stable nodes more thoroughly" I said that I'd never
    seen its WARN_ON_ONCE(page_mapped(page)). True at the time of writing,
    but it soon appeared once I tried fuller tests on the whole series.

    It turned out to be due to the KSM page migration itself: unmerge_and_
    remove_all_rmap_items() failed to locate and replace all the KSM pages,
    because of that hiatus in page migration when old pte has been replaced
    by migration entry, but not yet by new pte. follow_page() finds no page
    at that instant, but a KSM page reappears shortly after, without a
    fault.

    Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
    for KSM's break_cow(). I'd have preferred to avoid another flag, and do
    it every time, in case someone else makes the same easy mistake; but did
    not find another transgressor (the common get_user_pages() is of course
    safe), and cannot be sure that every follow_page() caller is prepared to
    sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
    already sleep there, since anon_vma locking was changed to mutex, but
    maybe that's somehow excluded.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Think of struct rmap_item as an extension of struct page (restricted to
    MADV_MERGEABLE areas): there may be a lot of them, we need to keep them
    small, especially on 32-bit architectures of limited lowmem.

    Siting "int nid" after "unsigned int checksum" works nicely on 64-bit,
    making no change to its 64-byte struct rmap_item; but bloats the 32-bit
    struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which
    rounds up to 40 bytes once allocated from slab. We'd better avoid that.

    Hey, I only just remembered that the anon_vma pointer in struct
    rmap_item has no purpose until the rmap_item is hung from a stable tree
    node (which has its own nid field); and rmap_item's nid field no purpose
    than to say which tree root to tell rb_erase() when unlinking from an
    unstable tree.

    Double them up in a union. There's just one place where we set anon_vma
    early (when we already hold mmap_sem): now we must remove tree_rmap_item
    from its unstable tree there, before overwriting nid. No need to
    spatter BUG()s around: we'd be seeing oopses if this were wrong.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins