11 Feb, 2020

1 commit

  • commit 5984fabb6e82d9ab4e6305cb99694c85d46de8ae upstream.

    Since commit a49bd4d71637 ("mm, numa: rework do_pages_move"), the
    semantic of move_pages() has changed to return the number of
    non-migrated pages if they were result of a non-fatal reasons (usually a
    busy page).

    This was an unintentional change that hasn't been noticed except for LTP
    tests which checked for the documented behavior.

    There are two ways to go around this change. We can even get back to
    the original behavior and return -EAGAIN whenever migrate_pages is not
    able to migrate pages due to non-fatal reasons. Another option would be
    to simply continue with the changed semantic and extend move_pages
    documentation to clarify that -errno is returned on an invalid input or
    when migration simply cannot succeed (e.g. -ENOMEM, -EBUSY) or the
    number of pages that couldn't have been migrated due to ephemeral
    reasons (e.g. page is pinned or locked for other reasons).

    This patch implements the second option because this behavior is in
    place for some time without anybody complaining and possibly new users
    depending on it. Also it allows to have a slightly easier error
    handling as the caller knows that it is worth to retry when err > 0.

    But since the new semantic would be aborted immediately if migration is
    failed due to ephemeral reasons, need include the number of
    non-attempted pages in the return value too.

    Link: http://lkml.kernel.org/r/1580160527-109104-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
    Signed-off-by: Yang Shi
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Reviewed-by: Wei Yang
    Cc: [4.17+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yang Shi
     

06 Feb, 2020

1 commit

  • [ Upstream commit dfe9aa23cab7880a794db9eb2d176c00ed064eb6 ]

    If we get here after successfully adding page to list, err would be 1 to
    indicate the page is queued in the list.

    Current code has two problems:

    * on success, 0 is not returned
    * on error, if add_page_for_migratioin() return 1, and the following err1
    from do_move_pages_to_node() is set, the err1 is not returned since err
    is 1

    And these behaviors break the user interface.

    Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com
    Fixes: e0153fc2c760 ("mm: move_pages: return valid node id in status if the page is already on the target node").
    Signed-off-by: Wei Yang
    Acked-by: Yang Shi
    Cc: John Hubbard
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Wei Yang
     

09 Jan, 2020

1 commit

  • commit e0153fc2c7606f101392b682e720a7a456d6c766 upstream.

    Felix Abecassis reports move_pages() would return random status if the
    pages are already on the target node by the below test program:

    int main(void)
    {
    const long node_id = 1;
    const long page_size = sysconf(_SC_PAGESIZE);
    const int64_t num_pages = 8;

    unsigned long nodemask = 1 << node_id;
    long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
    if (ret < 0)
    return (EXIT_FAILURE);

    void **pages = malloc(sizeof(void*) * num_pages);
    for (int i = 0; i < num_pages; ++i) {
    pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
    MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
    -1, 0);
    if (pages[i] == MAP_FAILED)
    return (EXIT_FAILURE);
    }

    ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
    if (ret < 0)
    return (EXIT_FAILURE);

    int *nodes = malloc(sizeof(int) * num_pages);
    int *status = malloc(sizeof(int) * num_pages);
    for (int i = 0; i < num_pages; ++i) {
    nodes[i] = node_id;
    status[i] = 0xd0; /* simulate garbage values */
    }

    ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
    printf("move_pages: %ld\n", ret);
    for (int i = 0; i < num_pages; ++i)
    printf("status[%d] = %d\n", i, status[i]);
    }

    Then running the program would return nonsense status values:

    $ ./move_pages_bug
    move_pages: 0
    status[0] = 208
    status[1] = 208
    status[2] = 208
    status[3] = 208
    status[4] = 208
    status[5] = 208
    status[6] = 208
    status[7] = 208

    This is because the status is not set if the page is already on the
    target node, but move_pages() should return valid status as long as it
    succeeds. The valid status may be errno or node id.

    We can't simply initialize status array to zero since the pages may be
    not on node 0. Fix it by updating status with node id which the page is
    already on.

    Link: http://lkml.kernel.org/r/1575584353-125392-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
    Signed-off-by: Yang Shi
    Reported-by: Felix Abecassis
    Tested-by: Felix Abecassis
    Suggested-by: Michal Hocko
    Reviewed-by: John Hubbard
    Acked-by: Christoph Lameter
    Acked-by: Michal Hocko
    Reviewed-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: [4.17+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yang Shi
     

26 Sep, 2019

1 commit

  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    This patch allows tagged pointers to be passed to the following memory
    syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
    mremap, msync, munlock, move_pages.

    The mmap and mremap syscalls do not currently accept tagged addresses.
    Architectures may interpret the tag as a background colour for the
    corresponding vma.

    Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Khalid Aziz
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Auger
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

25 Sep, 2019

3 commits

  • Remove unused 'pfn' variable.

    Link: http://lkml.kernel.org/r/1565167272-21453-1-git-send-email-kernelfans@gmail.com
    Signed-off-by: Pingfan Liu
    Reviewed-by: Andrew Morton
    Reviewed-by: Ralph Campbell
    Cc: "Jérôme Glisse"
    Cc: Mel Gorman
    Cc: Jan Kara
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pingfan Liu
     
  • Transparent Huge Pages are currently stored in i_pages as pointers to
    consecutive subpages. This patch changes that to storing consecutive
    pointers to the head page in preparation for storing huge pages more
    efficiently in i_pages.

    Large parts of this are "inspired" by Kirill's patch
    https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

    Kirill and Huang Ying contributed several fixes.

    [willy@infradead.org: use compound_nr, squish uninit-var warning]
    Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jan Kara
    Reviewed-by: Kirill Shutemov
    Reviewed-by: Song Liu
    Tested-by: Song Liu
    Tested-by: William Kucharski
    Reviewed-by: William Kucharski
    Tested-by: Qian Cai
    Tested-by: Mikhail Gavrilov
    Cc: Hugh Dickins
    Cc: Chris Wilson
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

07 Sep, 2019

2 commits

  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

22 Aug, 2019

1 commit

  • From rdma.git

    Jason Gunthorpe says:

    ====================
    This is a collection of general cleanups for ODP to clarify some of the
    flows around umem creation and use of the interval tree.
    ====================

    The branch is based on v5.3-rc5 due to dependencies, and is being taken
    into hmm.git due to dependencies in the next patches.

    * odp_fixes:
    RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
    RDMA/mlx5: Use ib_umem_start instead of umem.address
    RDMA/core: Make invalidate_range a device operation
    RDMA/odp: Use kvcalloc for the dma_list and page_list
    RDMA/odp: Check for overflow when computing the umem_odp end
    RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
    RDMA/odp: Split creating a umem_odp from ib_umem_get
    RDMA/odp: Make the three ways to create a umem_odp clear
    RMDA/odp: Consolidate umem_odp initialization
    RDMA/odp: Make it clearer when a umem is an implicit ODP umem
    RDMA/odp: Iterate over the whole rbtree directly
    RDMA/odp: Use the common interval tree library instead of generic
    RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

20 Aug, 2019

3 commits

  • CONFIG_MIGRATE_VMA_HELPER guards helpers that are required for proper
    devic private memory support. Remove the option and just check for
    CONFIG_DEVICE_PRIVATE instead.

    Link: https://lore.kernel.org/r/20190814075928.23766-11-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • No one ever checks this flag, and we could easily get that information
    from the page if needed.

    Link: https://lore.kernel.org/r/20190814075928.23766-10-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Reviewed-by: Jason Gunthorpe
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • There isn't any good reason to pass callbacks to migrate_vma. Instead
    we can just export the three steps done by this function to drivers and
    let them sequence the operation without callbacks. This removes a lot
    of boilerplate code as-is, and will allow the drivers to drastically
    improve code flow and error handling further on.

    Link: https://lore.kernel.org/r/20190814075928.23766-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

03 Aug, 2019

2 commits

  • When CONFIG_MIGRATE_VMA_HELPER is enabled, migrate_vma() calls
    migrate_vma_collect() which initializes a struct mm_walk but didn't
    initialize mm_walk.pud_entry. (Found by code inspection) Use a C
    structure initialization to make sure it is set to NULL.

    Link: http://lkml.kernel.org/r/20190719233225.12243-1-rcampbell@nvidia.com
    Fixes: 8763cb45ab967 ("mm/migrate: new memory migration helper for use with device memory")
    Signed-off-by: Ralph Campbell
    Reviewed-by: John Hubbard
    Reviewed-by: Andrew Morton
    Cc: "Jérôme Glisse"
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • buffer_migrate_page_norefs() can race with bh users in the following
    way:

    CPU1 CPU2
    buffer_migrate_page_norefs()
    buffer_migrate_lock_buffers()
    checks bh refs
    spin_unlock(&mapping->private_lock)
    __find_get_block()
    spin_lock(&mapping->private_lock)
    grab bh ref
    spin_unlock(&mapping->private_lock)
    move page do bh work

    This can result in various issues like lost updates to buffers (i.e.
    metadata corruption) or use after free issues for the old page.

    This patch closes the race by holding mapping->private_lock while the
    mapping is being moved to a new page. Ordinarily, a reference can be
    taken outside of the private_lock using the per-cpu BH LRU but the
    references are checked and the LRU invalidated if necessary. The
    private_lock is held once the references are known so the buffer lookup
    slow path will spin on the private_lock. Between the page lock and
    private_lock, it should be impossible for other references to be
    acquired and updates to happen during the migration.

    A user had reported data corruption issues on a distribution kernel with
    a similar page migration implementation as mainline. The data
    corruption could not be reproduced with this patch applied. A small
    number of migration-intensive tests were run and no performance problems
    were noted.

    [mgorman@techsingularity.net: Changelog, removed tracing]
    Link: http://lkml.kernel.org/r/20190718090238.GF24383@techsingularity.net
    Fixes: 89cb0888ca14 "mm: migrate: provide buffer_migrate_page_norefs()"
    Signed-off-by: Jan Kara
    Signed-off-by: Mel Gorman
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

19 Jul, 2019

1 commit

  • migrate_page_move_mapping() doesn't use the mode argument. Remove it
    and update callers accordingly.

    Link: http://lkml.kernel.org/r/20190508210301.8472-1-keith.busch@intel.com
    Signed-off-by: Keith Busch
    Reviewed-by: Zi Yan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Keith Busch
     

15 Jul, 2019

1 commit

  • Pull HMM updates from Jason Gunthorpe:
    "Improvements and bug fixes for the hmm interface in the kernel:

    - Improve clarity, locking and APIs related to the 'hmm mirror'
    feature merged last cycle. In linux-next we now see AMDGPU and
    nouveau to be using this API.

    - Remove old or transitional hmm APIs. These are hold overs from the
    past with no users, or APIs that existed only to manage cross tree
    conflicts. There are still a few more of these cleanups that didn't
    make the merge window cut off.

    - Improve some core mm APIs:
    - export alloc_pages_vma() for driver use
    - refactor into devm_request_free_mem_region() to manage
    DEVICE_PRIVATE resource reservations
    - refactor duplicative driver code into the core dev_pagemap
    struct

    - Remove hmm wrappers of improved core mm APIs, instead have drivers
    use the simplified API directly

    - Remove DEVICE_PUBLIC

    - Simplify the kconfig flow for the hmm users and core code"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
    mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
    mm: remove the HMM config option
    mm: sort out the DEVICE_PRIVATE Kconfig mess
    mm: simplify ZONE_DEVICE page private data
    mm: remove hmm_devmem_add
    mm: remove hmm_vma_alloc_locked_page
    nouveau: use devm_memremap_pages directly
    nouveau: use alloc_page_vma directly
    PCI/P2PDMA: use the dev_pagemap internal refcount
    device-dax: use the dev_pagemap internal refcount
    memremap: provide an optional internal refcount in struct dev_pagemap
    memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
    memremap: remove the data field in struct dev_pagemap
    memremap: add a migrate_to_ram method to struct dev_pagemap_ops
    memremap: lift the devmap_enable manipulation into devm_memremap_pages
    memremap: pass a struct dev_pagemap to ->kill and ->cleanup
    memremap: move dev_pagemap callbacks into a separate structure
    memremap: validate the pagemap type passed to devm_memremap_pages
    mm: factor out a devm_request_free_mem_region helper
    mm: export alloc_pages_vma
    ...

    Linus Torvalds
     

06 Jul, 2019

1 commit

  • This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f.

    Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
    __delete_from_swap_cache() to trigger:

    page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
    anon
    flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
    raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
    raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
    page dumped because: VM_BUG_ON_PAGE(entry != page)
    page->mem_cgroup:ffff978433ace000
    ------------[ cut here ]------------
    kernel BUG at mm/swap_state.c:170!
    invalid opcode: 0000 [#1] SMP NOPTI
    CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
    Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
    RIP: 0010:__delete_from_swap_cache+0x20d/0x240
    Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
    RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
    RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
    RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
    RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
    R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
    R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
    FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
    Call Trace:
    delete_from_swap_cache+0x46/0xa0
    try_to_free_swap+0xbc/0x110
    swap_writepage+0x13/0x70
    pageout.isra.0+0x13c/0x350
    shrink_page_list+0xc14/0xdf0
    shrink_inactive_list+0x1e5/0x3c0
    shrink_node_memcg+0x202/0x760
    shrink_node+0xe0/0x470
    balance_pgdat+0x2d1/0x510
    kswapd+0x220/0x420
    kthread+0xfb/0x130
    ret_from_fork+0x22/0x40

    and it's not immediately obvious why it happens. It's too late in the
    rc cycle to do anything but revert for now.

    Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
    Reported-and-bisected-by: Mikhail Gavrilov
    Suggested-by: Jan Kara
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Kirill Shutemov
    Cc: William Kucharski
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Jul, 2019

1 commit

  • The code hasn't been used since it was added to the tree, and doesn't
    appear to actually be usable.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Acked-by: Michal Hocko
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

15 May, 2019

3 commits

  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Transparent Huge Pages are currently stored in i_pages as pointers to
    consecutive subpages. This patch changes that to storing consecutive
    pointers to the head page in preparation for storing huge pages more
    efficiently in i_pages.

    Large parts of this are "inspired" by Kirill's patch
    https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

    [willy@infradead.org: fix swapcache pages]
    Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org
    [kirill@shutemov.name: hugetlb stores pages in page cache differently]
    Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1
    Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jan Kara
    Reviewed-by: Kirill Shutemov
    Reviewed-and-tested-by: Song Liu
    Tested-by: William Kucharski
    Reviewed-by: William Kucharski
    Tested-by: Qian Cai
    Cc: Hugh Dickins
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

30 Mar, 2019

1 commit

  • Our MIPS 1004Kc SoCs were seeing random userspace crashes with SIGILL
    and SIGSEGV that could not be traced back to a userspace code bug. They
    had all the magic signs of an I/D cache coherency issue.

    Now recently we noticed that the /proc/sys/vm/compact_memory interface
    was quite efficient at provoking this class of userspace crashes.

    Studying the code in mm/migrate.c there is a distinction made between
    migrating a page that is mapped at the instant of migration and one that
    is not mapped. Our problem turned out to be the non-mapped pages.

    For the non-mapped page the code performs a copy of the page content and
    all relevant meta-data of the page without doing the required D-cache
    maintenance. This leaves dirty data in the D-cache of the CPU and on
    the 1004K cores this data is not visible to the I-cache. A subsequent
    page-fault that triggers a mapping of the page will happily serve the
    process with potentially stale code.

    What about ARM then, this bug should have seen greater exposure? Well
    ARM became immune to this flaw back in 2010, see commit c01778001a4f
    ("ARM: 6379/1: Assume new page cache pages have dirty D-cache").

    My proposed fix moves the D-cache maintenance inside move_to_new_page to
    make it common for both cases.

    Link: http://lkml.kernel.org/r/20190315083502.11849-1-larper@axis.com
    Fixes: 97ee0524614 ("flush cache before installing new page at migraton")
    Signed-off-by: Lars Persson
    Reviewed-by: Paul Burton
    Acked-by: Mel Gorman
    Cc: Ralf Baechle
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lars Persson
     

06 Mar, 2019

4 commits

  • Andrea has noted that page migration code propagates page_mapping(page)
    through the whole migration stack down to migrate_page() function so it
    seems stupid to then use page_mapping(page) in expected_page_refs()
    instead of passed down 'mapping' argument. I agree so let's make
    expected_page_refs() more in line with the rest of the migration stack.

    Link: http://lkml.kernel.org/r/20190207112314.24872-1-jack@suse.cz
    Signed-off-by: Jan Kara
    Suggested-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • No functional change.

    Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: Pekka Enberg
    Acked-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Pages with no migration handler use a fallback handler which sometimes
    works and sometimes persistently retries. A historical example was
    blockdev pages but there are others such as odd refcounting when
    page->private is used. These are retried multiple times which is
    wasteful during compaction so this patch will fail migration faster
    unless the caller specifies MIGRATE_SYNC.

    This is not expected to help THP allocation success rates but it did
    reduce latencies very slightly in some cases.

    1-socket thpfioscale
    4.20.0 4.20.0
    noreserved-v2r15 failfast-v2r15
    Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
    Amean fault-both-3 3839.67 ( 0.00%) 3833.72 ( 0.15%)
    Amean fault-both-5 5177.47 ( 0.00%) 4967.15 ( 4.06%)
    Amean fault-both-7 7245.03 ( 0.00%) 7139.19 ( 1.46%)
    Amean fault-both-12 11534.89 ( 0.00%) 11326.30 ( 1.81%)
    Amean fault-both-18 16241.10 ( 0.00%) 16270.70 ( -0.18%)
    Amean fault-both-24 19075.91 ( 0.00%) 19839.65 ( -4.00%)
    Amean fault-both-30 22712.11 ( 0.00%) 21707.05 ( 4.43%)
    Amean fault-both-32 21692.92 ( 0.00%) 21968.16 ( -1.27%)

    The 2-socket results are not materially different. Scan rates are
    similar as expected.

    Link: http://lkml.kernel.org/r/20190118175136.31341-7-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patch series "arm64/mm: Enable HugeTLB migration", v4.

    This patch series enables HugeTLB migration support for all supported
    huge page sizes at all levels including contiguous bit implementation.
    Following HugeTLB migration support matrix has been enabled with this
    patch series. All permutations have been tested except for the 16GB.

    CONT PTE PMD CONT PMD PUD
    -------- --- -------- ---
    4K: 64K 2M 32M 1G
    16K: 2M 32M 1G
    64K: 2M 512M 16G

    First the series adds migration support for PUD based huge pages. It
    then adds a platform specific hook to query an architecture if a given
    huge page size is supported for migration while also providing a default
    fallback option preserving the existing semantics which just checks for
    (PMD|PUD|PGDIR)_SHIFT macros. The last two patches enables HugeTLB
    migration on arm64 and subscribe to this new platform specific hook by
    defining an override.

    The second patch differentiates between movability and migratability
    aspects of huge pages and implements hugepage_movable_supported() which
    can then be used during allocation to decide whether to place the huge
    page in movable zone or not.

    This patch (of 5):

    During huge page allocation it's migratability is checked to determine
    if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
    But the movability aspect of the huge page could depend on other factors
    than just migratability. Movability in itself is a distinct property
    which should not be tied with migratability alone.

    This differentiates these two and implements an enhanced movability check
    which also considers huge page size to determine if it is feasible to be
    placed under a movable zone. At present it just checks for gigantic pages
    but going forward it can incorporate other enhanced checks.

    Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Steve Capper
    Reviewed-by: Naoya Horiguchi
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

02 Mar, 2019

1 commit

  • hugetlb pages should only be migrated if they are 'active'. The
    routines set/clear_page_huge_active() modify the active state of hugetlb
    pages.

    When a new hugetlb page is allocated at fault time, set_page_huge_active
    is called before the page is locked. Therefore, another thread could
    race and migrate the page while it is being added to page table by the
    fault code. This race is somewhat hard to trigger, but can be seen by
    strategically adding udelay to simulate worst case scheduling behavior.
    Depending on 'how' the code races, various BUG()s could be triggered.

    To address this issue, simply delay the set_page_huge_active call until
    after the page is successfully added to the page table.

    Hugetlb pages can also be leaked at migration time if the pages are
    associated with a file in an explicitly mounted hugetlbfs filesystem.
    For example, consider a two node system with 4GB worth of huge pages
    available. A program mmaps a 2G file in a hugetlbfs filesystem. It
    then migrates the pages associated with the file from one node to
    another. When the program exits, huge page counts are as follows:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    0 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    That is as expected. 2G of huge pages are taken from the free_hugepages
    counts, and 2G is the size of the file in the explicitly mounted
    filesystem. If the file is then removed, the counts become:

    node0
    1024 free_hugepages
    1024 nr_hugepages

    node1
    1024 free_hugepages
    1024 nr_hugepages

    Filesystem Size Used Avail Use% Mounted on
    nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool

    Note that the filesystem still shows 2G of pages used, while there
    actually are no huge pages in use. The only way to 'fix' the filesystem
    accounting is to unmount the filesystem

    If a hugetlb page is associated with an explicitly mounted filesystem,
    this information in contained in the page_private field. At migration
    time, this information is not preserved. To fix, simply transfer
    page_private from old to new page at migration time if necessary.

    There is a related race with removing a huge page from a file and
    migration. When a huge page is removed from the pagecache, the
    page_mapping() field is cleared, yet page_private remains set until the
    page is actually freed by free_huge_page(). A page could be migrated
    while in this state. However, since page_mapping() is not set the
    hugetlbfs specific routine to transfer page_private is not called and we
    leak the page count in the filesystem.

    To fix that, check for this condition before migrating a huge page. If
    the condition is detected, return EBUSY for the page.

    Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
    Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
    Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Mel Gorman
    Cc: Davidlohr Bueso
    Cc:
    [mike.kravetz@oracle.com: v2]
    Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
    [mike.kravetz@oracle.com: update comment and changelog]
    Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

02 Feb, 2019

2 commits

  • We had a race in the old balloon compaction code before b1123ea6d3b3
    ("mm: balloon: use general non-lru movable page feature") refactored it
    that became visible after backporting 195a8c43e93d ("virtio-balloon:
    deflate via a page list") without the refactoring.

    The bug existed from commit d6d86c0a7f8d ("mm/balloon_compaction:
    redesign ballooned pages management") till b1123ea6d3b3 ("mm: balloon:
    use general non-lru movable page feature"). d6d86c0a7f8d
    ("mm/balloon_compaction: redesign ballooned pages management") was
    backported to 3.12, so the broken kernels are stable kernels [3.12 -
    4.7].

    There was a subtle race between dropping the page lock of the newpage in
    __unmap_and_move() and checking for __is_movable_balloon_page(newpage).

    Just after dropping this page lock, virtio-balloon could go ahead and
    deflate the newpage, effectively dequeueing it and clearing PageBalloon,
    in turn making __is_movable_balloon_page(newpage) fail.

    This resulted in dropping the reference of the newpage via
    putback_lru_page(newpage) instead of put_page(newpage), leading to
    page->lru getting modified and a !LRU page ending up in the LRU lists.
    With 195a8c43e93d ("virtio-balloon: deflate via a page list")
    backported, one would suddenly get corrupted lists in
    release_pages_balloon():

    - WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
    - list_del corruption. prev->next should be ffffe253961090a0, but was dead000000000100

    Nowadays this race is no longer possible, but it is hidden behind very
    ugly handling of __ClearPageMovable() and __PageMovable().

    __ClearPageMovable() will not make __PageMovable() fail, only
    PageMovable(). So the new check (__PageMovable(newpage)) will still
    hold even after newpage was dequeued by virtio-balloon.

    If anybody would ever change that special handling, the BUG would be
    introduced again. So instead, make it explicit and use the information
    of the original isolated page before migration.

    This patch can be backported fairly easy to stable kernels (in contrast
    to the refactoring).

    Link: http://lkml.kernel.org/r/20190129233217.10747-1-david@redhat.com
    Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management")
    Signed-off-by: David Hildenbrand
    Reported-by: Vratislav Bendel
    Acked-by: Michal Hocko
    Acked-by: Rafael Aquini
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Dominik Brodowski
    Cc: Matthew Wilcox
    Cc: Vratislav Bendel
    Cc: Rafael Aquini
    Cc: Konstantin Khlebnikov
    Cc: Minchan Kim
    Cc: [3.12 - 4.7]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Currently, buffer_migrate_page_norefs() was constantly failing because
    buffer_migrate_lock_buffers() grabbed reference on each buffer. In
    fact, there's no reason for buffer_migrate_lock_buffers() to grab any
    buffer references as the page is locked during all our operation and
    thus nobody can reclaim buffers from the page.

    So remove grabbing of buffer references which also makes
    buffer_migrate_page_norefs() succeed.

    Link: http://lkml.kernel.org/r/20190116131217.7226-1-jack@suse.cz
    Fixes: 89cb0888ca14 "mm: migrate: provide buffer_migrate_page_norefs()"
    Signed-off-by: Jan Kara
    Cc: Sergey Senozhatsky
    Cc: Pavel Machek
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

09 Jan, 2019

1 commit

  • This reverts b43a9990055958e70347c56f90ea2ae32c67334c

    The reverted commit caused issues with migration and poisoning of anon
    huge pages. The LTP move_pages12 test will cause an "unable to handle
    kernel NULL pointer" BUG would occur with stack similar to:

    RIP: 0010:down_write+0x1b/0x40
    Call Trace:
    migrate_pages+0x81f/0xb90
    __ia32_compat_sys_migrate_pages+0x190/0x190
    do_move_pages_to_node.isra.53.part.54+0x2a/0x50
    kernel_move_pages+0x566/0x7b0
    __x64_sys_move_pages+0x24/0x30
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The purpose of the reverted patch was to fix some long existing races
    with huge pmd sharing. It used i_mmap_rwsem for this purpose with the
    idea that this could also be used to address truncate/page fault races
    with another patch. Further analysis has determined that i_mmap_rwsem
    can not be used to address all these hugetlbfs synchronization issues.
    Therefore, revert this patch while working an another approach to the
    underlying issues.

    Link: http://lkml.kernel.org/r/20190103235452.29335-2-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Jan Stancek
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

05 Jan, 2019

1 commit

  • Patch series "Add support for fast mremap".

    This series speeds up the mremap(2) syscall by copying page tables at
    the PMD level even for non-THP systems. There is concern that the extra
    'address' argument that mremap passes to pte_alloc may do something
    subtle architecture related in the future that may make the scheme not
    work. Also we find that there is no point in passing the 'address' to
    pte_alloc since its unused. This patch therefore removes this argument
    tree-wide resulting in a nice negative diff as well. Also ensuring
    along the way that the enabled architectures do not do anything funky
    with the 'address' argument that goes unnoticed by the optimization.

    Build and boot tested on x86-64. Build tested on arm64. The config
    enablement patch for arm64 will be posted in the future after more
    testing.

    The changes were obtained by applying the following Coccinelle script.
    (thanks Julia for answering all Coccinelle questions!).
    Following fix ups were done manually:
    * Removal of address argument from pte_fragment_alloc
    * Removal of pte_alloc_one_fast definitions from m68k and microblaze.

    // Options: --include-headers --no-includes
    // Note: I split the 'identifier fn' line, so if you are manually
    // running it, please unsplit it so it runs for you.

    virtual patch

    @pte_alloc_func_def depends on patch exists@
    identifier E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    type T2;
    @@

    fn(...
    - , T2 E2
    )
    { ... }

    @pte_alloc_func_proto_noarg depends on patch exists@
    type T1, T2, T3, T4;
    identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1, T2);
    + T3 fn(T1);
    |
    - T3 fn(T1, T2, T4);
    + T3 fn(T1, T2);
    )

    @pte_alloc_func_proto depends on patch exists@
    identifier E1, E2, E4;
    type T1, T2, T3, T4;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1 E1, T2 E2);
    + T3 fn(T1 E1);
    |
    - T3 fn(T1 E1, T2 E2, T4 E4);
    + T3 fn(T1 E1, T2 E2);
    )

    @pte_alloc_func_call depends on patch exists@
    expression E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    fn(...
    -, E2
    )

    @pte_alloc_macro depends on patch exists@
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    identifier a, b, c;
    expression e;
    position p;
    @@

    (
    - #define fn(a, b, c) e
    + #define fn(a, b) e
    |
    - #define fn(a, b) e
    + #define fn(a) e
    )

    Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
    Signed-off-by: Joel Fernandes (Google)
    Suggested-by: Kirill A. Shutemov
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Julia Lawall
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     

29 Dec, 2018

8 commits

  • While looking at BUGs associated with invalid huge page map counts, it was
    discovered and observed that a huge pte pointer could become 'invalid' and
    point to another task's page table. Consider the following:

    A task takes a page fault on a shared hugetlbfs file and calls
    huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
    shared pmd.

    Now, another task truncates the hugetlbfs file. As part of truncation, it
    unmaps everyone who has the file mapped. If the range being truncated is
    covered by a shared pmd, huge_pmd_unshare will be called. For all but the
    last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
    to the pmd. If the task in the middle of the page fault is not the last
    user, the ptep returned by huge_pte_alloc now points to another task's
    page table or worse. This leads to bad things such as incorrect page
    map/reference counts or invalid memory references.

    To fix, expand the use of i_mmap_rwsem as follows:

    - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
    huge_pmd_share is only called via huge_pte_alloc, so callers of
    huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
    of huge_pte_alloc continue to hold the semaphore until finished with the
    ptep.

    - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
    called.

    [mike.kravetz@oracle.com: add explicit check for mapping != null]
    Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc: Colin Ian King
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • All callers of migrate_page_move_mapping() now pass NULL for 'head'
    argument. Drop it.

    Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Provide a variant of buffer_migrate_page() that also checks whether there
    are no unexpected references to buffer heads. This function will then be
    safe to use for block device pages.

    [akpm@linux-foundation.org: remove EXPORT_SYMBOL(buffer_migrate_page_norefs)]
    Link: http://lkml.kernel.org/r/20181211172143.7358-5-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • buffer_migrate_page() is the only caller of migrate_page_lock_buffers()
    move it close to it and also drop the now unused stub for !CONFIG_BLOCK.

    Link: http://lkml.kernel.org/r/20181211172143.7358-4-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Lock buffers before calling into migrate_page_move_mapping() so that that
    function doesn't have to know about buffers (which is somewhat unexpected
    anyway) and all the buffer head logic is in buffer_migrate_page().

    Link: http://lkml.kernel.org/r/20181211172143.7358-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Patch series "mm: migrate: Fix page migration stalls for blkdev pages".

    This patchset deals with page migration stalls that were reported by our
    customer due to a block device page that had a bufferhead that was in the
    bh LRU cache.

    The patchset modifies the page migration code so that bufferheads are
    completely handled inside buffer_migrate_page() and then provides a new
    migration helper for pages with buffer heads that is safe to use even for
    block device pages and that also deals with bh lrus.

    This patch (of 6):

    Factor out function to compute number of expected page references in
    migrate_page_move_mapping(). Note that we move hpage_nr_pages() and
    page_has_private() checks from under xas_lock_irq() however this is safe
    since we hold page lock.

    [jack@suse.cz: fix expected_page_refs()]
    Link: http://lkml.kernel.org/r/20181217131710.GB8611@quack2.suse.cz
    Link: http://lkml.kernel.org/r/20181211172143.7358-2-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Waiting on a page migration entry has used wait_on_page_locked() all along
    since 2006: but you cannot safely wait_on_page_locked() without holding a
    reference to the page, and that extra reference is enough to make
    migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
    on the entry before migrate_page_move_mapping() gets there.

    And that failure is retried nine times, amplifying the pain when trying to
    migrate a popular page. With a single persistent faulter, migration
    sometimes succeeds; with two or three concurrent faulters, success becomes
    much less likely (and the more the page was mapped, the worse the overhead
    of unmapping and remapping it on each try).

    This is especially a problem for memory offlining, where the outer level
    retries forever (or until terminated from userspace), because a heavy
    refault workload can trigger an endless loop of migration failures.
    wait_on_page_locked() is the wrong tool for the job.

    David Herrmann (but was he the first?) noticed this issue in 2014:
    https://marc.info/?l=linux-mm&m=140110465608116&w=2

    Tim Chen started a thread in August 2017 which appears relevant:
    https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
    on to implicate __migration_entry_wait():
    https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
    up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
    list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
    wake_up_page_bit")

    Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
    https://marc.info/?l=linux-mm&m=154217936431300&w=2

    We have all assumed that it is essential to hold a page reference while
    waiting on a page lock: partly to guarantee that there is still a struct
    page when MEMORY_HOTREMOVE is configured, but also to protect against
    reuse of the struct page going to someone who then holds the page locked
    indefinitely, when the waiter can reasonably expect timely unlocking.

    But in fact, so long as wait_on_page_bit_common() does the put_page(), and
    is careful not to rely on struct page contents thereafter, there is no
    need to hold a reference to the page while waiting on it. That does mean
    that this case cannot go back through the loop: but that's fine for the
    page migration case, and even if used more widely, is limited by the "Stop
    walking if it's locked" optimization in wake_page_function().

    Add interface put_and_wait_on_page_locked() to do this, using "behavior"
    enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
    No interruptible or killable variant needed yet, but they might follow: I
    have a vague notion that reporting -EINTR should take precedence over
    return from wait_on_page_bit_common() without knowing the page state, so
    arrange it accordingly - but that may be nothing but pedantic.

    __migration_entry_wait() still has to take a brief reference to the page,
    prior to calling put_and_wait_on_page_locked(): but now that it is dropped
    before waiting, the chance of impeding page migration is very much
    reduced. Should we perhaps disable preemption across this?

    shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
    survived a lot of testing before that showed up. PageWaiters may have
    been set by wait_on_page_bit_common(), and the reference dropped, just
    before shrink_page_list() succeeds in freezing its last page reference: in
    such a case, unlock_page() must be used. Follow the suggestion from
    Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
    that optimization predates PageWaiters, and won't buy much these days; but
    we can reinstate it for the !PageWaiters case if anyone notices.

    It does raise the question: should vmscan.c's is_page_cache_freeable() and
    __remove_mapping() now treat a PageWaiters page as if an extra reference
    were held? Perhaps, but I don't think it matters much, since
    shrink_page_list() already had to win its trylock_page(), so waiters are
    not very common there: I noticed no difference when trying the bigger
    change, and it's surely not needed while put_and_wait_on_page_locked() is
    only used for page migration.

    [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
    Signed-off-by: Hugh Dickins
    Reported-by: Baoquan He
    Tested-by: Baoquan He
    Reviewed-by: Andrea Arcangeli
    Acked-by: Michal Hocko
    Acked-by: Linus Torvalds
    Acked-by: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Baoquan He
    Cc: David Hildenbrand
    Cc: Mel Gorman
    Cc: David Herrmann
    Cc: Tim Chen
    Cc: Kan Liang
    Cc: Andi Kleen
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins