19 Oct, 2019

1 commit

  • Include for the definition of is_vma_temporary_stack
    to fix the following sparse warning:

    mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20191009151155.27763-1-ben.dooks@codethink.co.uk
    Signed-off-by: Ben Dooks
    Reviewed-by: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Dooks
     

25 Sep, 2019

4 commits

  • This patch is (hopefully) the first step to enable THP for non-shmem
    filesystems.

    This patch enables an application to put part of its text sections to THP
    via madvise, for example:

    madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);

    We tried to reuse the logic for THP on tmpfs.

    Currently, write is not supported for non-shmem THP. khugepaged will only
    process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
    (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
    execve(). This requirement limits non-shmem THP to text sections.

    The next patch will handle writes, which would only happen when the all
    the vmas with VM_DENYWRITE are unmapped.

    An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
    feature.

    [songliubraving@fb.com: fix build without CONFIG_SHMEM]
    Link: http://lkml.kernel.org/r/F53407FB-96CC-42E8-9862-105C92CC2B98@fb.com
    [songliubraving@fb.com: fix double unlock in collapse_file()]
    Link: http://lkml.kernel.org/r/B960CBFA-8EFC-4DA4-ABC5-1977FFF2CA57@fb.com
    Link: http://lkml.kernel.org/r/20190801184244.3169074-7-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Stephen Rothwell
    Cc: Dan Carpenter
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Patch series "Make working with compound pages easier", v2.

    These three patches add three helpers and convert the appropriate
    places to use them.

    This patch (of 3):

    It's unnecessarily hard to find out the size of a potentially huge page.
    Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).

    Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Fixes gcc '-Wunused-but-set-variable' warning:

    mm/rmap.c: In function page_mkclean_one:
    mm/rmap.c:906:17: warning: variable cstart set but not used [-Wunused-but-set-variable]

    It is not used any more since
    commit cdb07bdea28e ("mm/rmap.c: remove redundant variable cend")

    Link: http://lkml.kernel.org/r/20190724141453.38536-1-yuehaibing@huawei.com
    Signed-off-by: YueHaibing
    Reported-by: Hulk Robot
    Reviewed-by: Mike Kravetz
    Reviewed-by: Kirill Tkhai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YueHaibing
     

14 Aug, 2019

1 commit

  • When migrating an anonymous private page to a ZONE_DEVICE private page,
    the source page->mapping and page->index fields are copied to the
    destination ZONE_DEVICE struct page and the page_mapcount() is
    increased. This is so rmap_walk() can be used to unmap and migrate the
    page back to system memory.

    However, try_to_unmap_one() computes the subpage pointer from a swap pte
    which computes an invalid page pointer and a kernel panic results such
    as:

    BUG: unable to handle page fault for address: ffffea1fffffffc8

    Currently, only single pages can be migrated to device private memory so
    no subpage computation is needed and it can be set to "page".

    [rcampbell@nvidia.com: add comment]
    Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
    Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
    Fixes: a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
    Signed-off-by: Ralph Campbell
    Cc: "Jérôme Glisse"
    Cc: "Kirill A. Shutemov"
    Cc: Mike Kravetz
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: Andrea Arcangeli
    Cc: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Ira Weiny
    Cc: Jan Kara
    Cc: Lai Jiangshan
    Cc: Logan Gunthorpe
    Cc: Martin Schwidefsky
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

15 May, 2019

4 commits

  • We have the pra.mapcount already, and there is no need to call the
    page_mapped() which may do some complicated computing for compound page.

    Link: http://lkml.kernel.org/r/20190404054828.2731-1-sjhuang@iluvatar.ai
    Signed-off-by: Huang Shijie
    Acked-by: Kirill A. Shutemov
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • MADV_DONTNEED is handled with mmap_sem taken in read mode. We call
    page_mkclean without holding mmap_sem.

    MADV_DONTNEED implies that pages in the region are unmapped and subsequent
    access to the pages in that range is handled as a new page fault. This
    implies that if we don't have parallel access to the region when
    MADV_DONTNEED is run we expect those range to be unallocated.

    w.r.t page_mkclean() we need to make sure that we don't break the
    MADV_DONTNEED semantics. MADV_DONTNEED check for pmd_none without holding
    pmd_lock. This implies we skip the pmd if we temporarily mark pmd none.
    Avoid doing that while marking the page clean.

    Keep the sequence same for dax too even though we don't support
    MADV_DONTNEED for dax mapping

    The bug was noticed by code review and I didn't observe any failures w.r.t
    test run. This is similar to

    commit 58ceeb6bec86d9140f9d91d71a710e963523d063
    Author: Kirill A. Shutemov
    Date: Thu Apr 13 14:56:26 2017 -0700

    thp: fix MADV_DONTNEED vs. MADV_FREE race

    commit ced108037c2aa542b3ed8b7afd1576064ad1362a
    Author: Kirill A. Shutemov
    Date: Thu Apr 13 14:56:20 2017 -0700

    thp: fix MADV_DONTNEED vs. numa balancing race

    Link: http://lkml.kernel.org/r/20190321040610.14226-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Andrew Morton
    Cc: Dan Williams
    Cc:"Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

06 Mar, 2019

1 commit

  • We have common pattern to access lru_lock from a page pointer:
    zone_lru_lock(page_zone(page))

    Which is silly, because it unfolds to this:
    &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
    while we can simply do
    &NODE_DATA(page_to_nid(page))->lru_lock

    Remove zone_lru_lock() function, since it's only complicate things. Use
    'page_pgdat(page)->lru_lock' pattern instead.

    [aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()]
    Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: William Kucharski
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

10 Jan, 2019

1 commit

  • The conversion to use a structure for mmu_notifier_invalidate_range_*()
    unintentionally changed the usage in try_to_unmap_one() to init the
    'struct mmu_notifier_range' with vma->vm_start instead of @address,
    i.e. it invalidates the wrong address range. Revert to the correct
    address range.

    Manifests as KVM use-after-free WARNINGs and subsequent "BUG: Bad page
    state in process X" errors when reclaiming from a KVM guest due to KVM
    removing the wrong pages from its own mappings.

    Reported-by: leozinho29_eu@hotmail.com
    Reported-by: Mike Galbraith
    Reported-and-tested-by: Adam Borowski
    Reviewed-by: Jérôme Glisse
    Reviewed-by: Pankaj gupta
    Cc: Christian König
    Cc: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    Cc: Andrew Morton
    Fixes: ac46d4f3c432 ("mm/mmu_notifier: use structure for invalidate_range_start/end calls v2")
    Signed-off-by: Sean Christopherson
    Signed-off-by: Linus Torvalds

    Sean Christopherson
     

09 Jan, 2019

1 commit

  • This reverts b43a9990055958e70347c56f90ea2ae32c67334c

    The reverted commit caused issues with migration and poisoning of anon
    huge pages. The LTP move_pages12 test will cause an "unable to handle
    kernel NULL pointer" BUG would occur with stack similar to:

    RIP: 0010:down_write+0x1b/0x40
    Call Trace:
    migrate_pages+0x81f/0xb90
    __ia32_compat_sys_migrate_pages+0x190/0x190
    do_move_pages_to_node.isra.53.part.54+0x2a/0x50
    kernel_move_pages+0x566/0x7b0
    __x64_sys_move_pages+0x24/0x30
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The purpose of the reverted patch was to fix some long existing races
    with huge pmd sharing. It used i_mmap_rwsem for this purpose with the
    idea that this could also be used to address truncate/page fault races
    with another patch. Further analysis has determined that i_mmap_rwsem
    can not be used to address all these hugetlbfs synchronization issues.
    Therefore, revert this patch while working an another approach to the
    underlying issues.

    Link: http://lkml.kernel.org/r/20190103235452.29335-2-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Jan Stancek
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

29 Dec, 2018

3 commits

  • While looking at BUGs associated with invalid huge page map counts, it was
    discovered and observed that a huge pte pointer could become 'invalid' and
    point to another task's page table. Consider the following:

    A task takes a page fault on a shared hugetlbfs file and calls
    huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
    shared pmd.

    Now, another task truncates the hugetlbfs file. As part of truncation, it
    unmaps everyone who has the file mapped. If the range being truncated is
    covered by a shared pmd, huge_pmd_unshare will be called. For all but the
    last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
    to the pmd. If the task in the middle of the page fault is not the last
    user, the ptep returned by huge_pte_alloc now points to another task's
    page table or worse. This leads to bad things such as incorrect page
    map/reference counts or invalid memory references.

    To fix, expand the use of i_mmap_rwsem as follows:

    - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
    huge_pmd_share is only called via huge_pte_alloc, so callers of
    huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
    of huge_pte_alloc continue to hold the semaphore until finished with the
    ptep.

    - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
    called.

    [mike.kravetz@oracle.com: add explicit check for mapping != null]
    Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc: Colin Ian King
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • This function is identical to __page_set_anon_rmap() since the time, when
    it was introduced (8 years ago). The patch removes the function, and
    makes its users to use __page_set_anon_rmap() instead.

    Link: http://lkml.kernel.org/r/154504875359.30235.6237926369392564851.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: Jerome Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

01 Dec, 2018

1 commit

  • The term "freeze" is used in several ways in the kernel, and in mm it
    has the particular meaning of forcing page refcount temporarily to 0.
    freeze_page() is just too confusing a name for a function that unmaps a
    page: rename it unmap_page(), and rename unfreeze_page() remap_page().

    Went to change the mention of freeze_page() added later in mm/rmap.c,
    but found it to be incorrect: ordinary page reclaim reaches there too;
    but the substance of the comment still seems correct, so edit it down.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261514080.2275@eggly.anvils
    Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Oct, 2018

1 commit

  • The page migration code employs try_to_unmap() to try and unmap the source
    page. This is accomplished by using rmap_walk to find all vmas where the
    page is mapped. This search stops when page mapcount is zero. For shared
    PMD huge pages, the page map count is always 1 no matter the number of
    mappings. Shared mappings are tracked via the reference count of the PMD
    page. Therefore, try_to_unmap stops prematurely and does not completely
    unmap all mappings of the source page.

    This problem can result is data corruption as writes to the original
    source page can happen after contents of the page are copied to the target
    page. Hence, data is lost.

    This problem was originally seen as DB corruption of shared global areas
    after a huge page was soft offlined due to ECC memory errors. DB
    developers noticed they could reproduce the issue by (hotplug) offlining
    memory used to back huge pages. A simple testcase can reproduce the
    problem by creating a shared PMD mapping (note that this must be at least
    PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
    migrate_pages() to migrate process pages between nodes while continually
    writing to the huge pages being migrated.

    To fix, have the try_to_unmap_one routine check for huge PMD sharing by
    calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared
    mapping it will be 'unshared' which removes the page table entry and drops
    the reference on the PMD page. After this, flush caches and TLB.

    mmu notifiers are called before locking page tables, but we can not be
    sure of PMD sharing until page tables are locked. Therefore, check for
    the possibility of PMD sharing before locking so that notifiers can
    prepare for the worst possible case.

    Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
    [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
    Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Jerome Glisse
    Cc: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

15 Jul, 2018

1 commit

  • KVM guests on s390 can notify the host of unused pages. This can result
    in pte_unused callbacks to be true for KVM guest memory.

    If a page is unused (checked with pte_unused) we might drop this page
    instead of paging it. This can have side-effects on userfaultd, when
    the page in question was already migrated:

    The next access of that page will trigger a fault and a user fault
    instead of faulting in a new and empty zero page. As QEMU does not
    expect a userfault on an already migrated page this migration will fail.

    The most straightforward solution is to ignore the pte_unused hint if a
    userfault context is active for this VMA.

    Link: http://lkml.kernel.org/r/20180703171854.63981-1-borntraeger@de.ibm.com
    Signed-off-by: Christian Borntraeger
    Cc: Martin Schwidefsky
    Cc: Andrea Arcangeli
    Cc: Mike Rapoport
    Cc: Janosch Frank
    Cc: David Hildenbrand
    Cc: Cornelia Huck
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Borntraeger
     

28 Apr, 2018

1 commit


21 Apr, 2018

1 commit

  • My testing for the latest kernel supporting thp migration showed an
    infinite loop in offlining the memory block that is filled with shmem
    thps. We can get out of the loop with a signal, but kernel should return
    with failure in this case.

    What happens in the loop is that scan_movable_pages() repeats returning
    the same pfn without any progress. That's because page migration always
    fails for shmem thps.

    In memory offline code, memory blocks containing unmovable pages should be
    prevented from being offline targets by has_unmovable_pages() inside
    start_isolate_page_range(). So it's possible to change migratability for
    non-anonymous thps to avoid the issue, but it introduces more complex and
    thp-specific handling in migration code, so it might not good.

    So this patch is suggesting to fix the issue by enabling thp migration for
    shmem thp. Both of anon/shmem thp are migratable so we don't need
    precheck about the type of thps.

    Link: http://lkml.kernel.org/r/20180406030706.GA2434@hori1.linux.bs1.fc.nec.co.jp
    Fixes: commit 72b39cfc4d75 ("mm, memory_hotplug: do not fail offlining too early")
    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: Zi Yan
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

17 Apr, 2018

2 commits

  • Mike Rapoport says:

    These patches convert files in Documentation/vm to ReST format, add an
    initial index and link it to the top level documentation.

    There are no contents changes in the documentation, except few spelling
    fixes. The relatively large diffstat stems from the indentation and
    paragraph wrapping changes.

    I've tried to keep the formatting as consistent as possible, but I could
    miss some places that needed markup and add some markup where it was not
    necessary.

    [jc: significant conflicts in vm/hmm.rst]

    Jonathan Corbet
     
  • Signed-off-by: Mike Rapoport
    Signed-off-by: Jonathan Corbet

    Mike Rapoport
     

12 Apr, 2018

1 commit

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

06 Apr, 2018

1 commit


18 Mar, 2018

1 commit

  • If a processor supports special metadata for a page, for example ADI
    version tags on SPARC M7, this metadata must be saved when the page is
    swapped out. The same metadata must be restored when the page is swapped
    back in. This patch adds two new architecture specific functions -
    arch_do_swap_page() to be called when a page is swapped in, and
    arch_unmap_one() to be called when a page is being unmapped for swap
    out. These architecture hooks allow page metadata to be saved if the
    architecture supports it.

    Signed-off-by: Khalid Aziz
    Cc: Khalid Aziz
    Acked-by: Jerome Marchand
    Reviewed-by: Anthony Yznaga
    Acked-by: Andrew Morton
    Signed-off-by: David S. Miller

    Khalid Aziz
     

16 Nov, 2017

3 commits

  • Most callers users of free_hot_cold_page claim the pages being released
    are cache hot. The exception is the page reclaim paths where it is
    likely that enough pages will be freed in the near future that the
    per-cpu lists are going to be recycled and the cache hotness information
    is lost. As no one really cares about the hotness of pages being
    released to the allocator, just ditch the parameter.

    The APIs are renamed to indicate that it's no longer about hot/cold
    pages. It should also be less confusing as there are subtle differences
    between them. __free_pages drops a reference and frees a page when the
    refcount reaches zero. free_hot_cold_page handled pages whose refcount
    was already zero which is non-obvious from the name. free_unref_page
    should be more obvious.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    [mgorman@techsingularity.net: add pages to head, not tail]
    Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
    Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Variable cend is set but never read, hence it is redundant and can be
    removed.

    Cleans up clang build warning: Value stored to 'cend' is never read

    Link: http://lkml.kernel.org/r/20171011174942.1372-1-colin.king@canonical.com
    Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
    Signed-off-by: Colin Ian King
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • This patch only affects users of mmu_notifier->invalidate_range callback
    which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
    and it is an optimization for those users. Everyone else is unaffected
    by it.

    When clearing a pte/pmd we are given a choice to notify the event under
    the page table lock (notify version of *_clear_flush helpers do call the
    mmu_notifier_invalidate_range). But that notification is not necessary
    in all cases.

    This patch removes almost all cases where it is useless to have a call
    to mmu_notifier_invalidate_range before
    mmu_notifier_invalidate_range_end. It also adds documentation in all
    those cases explaining why.

    Below is a more in depth analysis of why this is fine to do this:

    For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
    device use thing like ATS/PASID to get the IOMMU to walk the CPU page
    table to access a process virtual address space). There is only 2 cases
    when you need to notify those secondary TLB while holding page table
    lock when clearing a pte/pmd:

    A) page backing address is free before mmu_notifier_invalidate_range_end
    B) a page table entry is updated to point to a new page (COW, write fault
    on zero page, __replace_page(), ...)

    Case A is obvious you do not want to take the risk for the device to write
    to a page that might now be used by something completely different.

    Case B is more subtle. For correctness it requires the following sequence
    to happen:
    - take page table lock
    - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
    - set page table entry to point to new page

    If clearing the page table entry is not followed by a notify before setting
    the new pte/pmd value then you can break memory model like C11 or C++11 for
    the device.

    Consider the following scenario (device use a feature similar to ATS/
    PASID):

    Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
    assume they are write protected for COW (other case of B apply too).

    [Time N] -----------------------------------------------------------------
    CPU-thread-0 {try to write to addrA}
    CPU-thread-1 {try to write to addrB}
    CPU-thread-2 {}
    CPU-thread-3 {}
    DEV-thread-0 {read addrA and populate device TLB}
    DEV-thread-2 {read addrB and populate device TLB}
    [Time N+1] ---------------------------------------------------------------
    CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
    CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
    CPU-thread-2 {}
    CPU-thread-3 {}
    DEV-thread-0 {}
    DEV-thread-2 {}
    [Time N+2] ---------------------------------------------------------------
    CPU-thread-0 {COW_step1: {update page table point to new page for addrA}}
    CPU-thread-1 {COW_step1: {update page table point to new page for addrB}}
    CPU-thread-2 {}
    CPU-thread-3 {}
    DEV-thread-0 {}
    DEV-thread-2 {}
    [Time N+3] ---------------------------------------------------------------
    CPU-thread-0 {preempted}
    CPU-thread-1 {preempted}
    CPU-thread-2 {write to addrA which is a write to new page}
    CPU-thread-3 {}
    DEV-thread-0 {}
    DEV-thread-2 {}
    [Time N+3] ---------------------------------------------------------------
    CPU-thread-0 {preempted}
    CPU-thread-1 {preempted}
    CPU-thread-2 {}
    CPU-thread-3 {write to addrB which is a write to new page}
    DEV-thread-0 {}
    DEV-thread-2 {}
    [Time N+4] ---------------------------------------------------------------
    CPU-thread-0 {preempted}
    CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
    CPU-thread-2 {}
    CPU-thread-3 {}
    DEV-thread-0 {}
    DEV-thread-2 {}
    [Time N+5] ---------------------------------------------------------------
    CPU-thread-0 {preempted}
    CPU-thread-1 {}
    CPU-thread-2 {}
    CPU-thread-3 {}
    DEV-thread-0 {read addrA from old page}
    DEV-thread-2 {read addrB from new page}

    So here because at time N+2 the clear page table entry was not pair with a
    notification to invalidate the secondary TLB, the device see the new value
    for addrB before seing the new value for addrA. This break total memory
    ordering for the device.

    When changing a pte to write protect or to point to a new write protected
    page with same content (KSM) it is ok to delay invalidate_range callback
    to mmu_notifier_invalidate_range_end() outside the page table lock. This
    is true even if the thread doing page table update is preempted right
    after releasing page table lock before calling
    mmu_notifier_invalidate_range_end

    Thanks to Andrea for thinking of a problematic scenario for COW.

    [jglisse@redhat.com: v2]
    Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Andrea Arcangeli
    Cc: Nadav Amit
    Cc: Joerg Roedel
    Cc: Suravee Suthikulpanit
    Cc: David Woodhouse
    Cc: Alistair Popple
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Stephen Rothwell
    Cc: Andrew Donnellan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

09 Sep, 2017

4 commits

  • Allow interval trees to quickly check for overlaps to avoid unnecesary
    tree lookups in interval_tree_iter_first().

    As of this patch, all interval tree flavors will require using a
    'rb_root_cached' such that we can have the leftmost node easily
    available. While most users will make use of this feature, those with
    special functions (in addition to the generic insert, delete, search
    calls) will avoid using the cached option as they can do funky things
    with insertions -- for example, vma_interval_tree_insert_after().

    [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
    Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Doug Ledford
    Acked-by: Michael S. Tsirkin
    Cc: David Airlie
    Cc: Jason Wang
    Cc: Christian Benvenuti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Allow to unmap and restore special swap entry of un-addressable
    ZONE_DEVICE memory.

    Link: http://lkml.kernel.org/r/20170817000548.32038-17-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Kirill A. Shutemov
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Add thp migration's core code, including conversions between a PMD entry
    and a swap entry, setting PMD migration entry, removing PMD migration
    entry, and waiting on PMD migration entries.

    This patch makes it possible to support thp migration. If you fail to
    allocate a destination page as a thp, you just split the source thp as
    we do now, and then enter the normal page migration. If you succeed to
    allocate destination thp, you enter thp migration. Subsequent patches
    actually enable thp migration for each caller of page migration by
    allowing its get_new_page() callback to allocate thps.

    [zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning]
    Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu
    [akpm@linux-foundation.org: fix x86_64 allnoconfig warning]
    Signed-off-by: Zi Yan
    Acked-by: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     
  • TTU_MIGRATION is used to convert pte into migration entry until thp
    split completes. This behavior conflicts with thp migration added later
    patches, so let's introduce a new TTU flag specifically for freezing.

    try_to_unmap() is used both for thp split (via freeze_page()) and page
    migration (via __unmap_and_move()). In freeze_page(), ttu_flag given
    for head page is like below (assuming anonymous thp):

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
    TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)

    and ttu_flag given for tail pages is:

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
    TTU_MIGRATION)

    __unmap_and_move() calls try_to_unmap() with ttu_flag:

    (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)

    Now I'm trying to insert a branch for thp migration at the top of
    try_to_unmap_one() like below

    static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
    unsigned long address, void *arg)
    {
    ...
    /* PMD-mapped THP migration entry */
    if (!pvmw.pte && (flags & TTU_MIGRATION)) {
    if (!PageAnon(page))
    continue;

    set_pmd_migration_entry(&pvmw, page);
    continue;
    }
    ...
    }

    so try_to_unmap() for tail pages called by thp split can go into thp
    migration code path (which converts *pmd* into migration entry), while
    the expectation is to freeze thp (which converts *pte* into migration
    entry.)

    I detected this failure as a "bad page state" error in a testcase where
    split_huge_page() is called from queue_pages_pte_range().

    Link: http://lkml.kernel.org/r/20170717193955.20207-4-zi.yan@sent.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Acked-by: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

01 Sep, 2017

1 commit

  • Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
    and make sure it is bracketed by calls to *_invalidate_range_start()/end().

    Note that because we can not presume the pmd value or pte value we have
    to assume the worst and unconditionaly report an invalidation as
    happening.

    Changed since v2:
    - try_to_unmap_one() only one call to mmu_notifier_invalidate_range()
    - compute end with PAGE_SIZE << compound_order(page)
    - fix PageHuge() case in try_to_unmap_one()

    Signed-off-by: Jérôme Glisse
    Reviewed-by: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Bernhard Held
    Cc: Adam Borowski
    Cc: Radim Krčmář
    Cc: Wanpeng Li
    Cc: Paolo Bonzini
    Cc: Takashi Iwai
    Cc: Nadav Amit
    Cc: Mike Galbraith
    Cc: Kirill A. Shutemov
    Cc: axie
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

30 Aug, 2017

1 commit

  • This reverts commit aac2fea94f7a3df8ad1eeb477eb2643f81fd5393.

    It turns out that that patch was complete and utter garbage, and broke
    KVM, resulting in odd oopses.

    Quoting Andrea Arcangeli:
    "The aforementioned commit has 3 bugs.

    1) mmu_notifier_invalidate_range cannot be used in replacement of
    mmu_notifier_invalidate_range_start/end.

    For KVM mmu_notifier_invalidate_range is a noop and rightfully so.

    A MMU notifier implementation has to implement either
    ->invalidate_range method or the invalidate_range_start/end
    methods, not both. And if you implement invalidate_range_start/end
    like KVM is forced to do, calling mmu_notifier_invalidate_range in
    common code is a noop for KVM.

    For those MMU notifiers that can get away only implementing
    ->invalidate_range, the ->invalidate_range is implicitly called by
    mmu_notifier_invalidate_range_end(). And only those secondary MMUs
    that share the same pagetable with the primary MMU (like AMD
    iommuv2) can get away only implementing ->invalidate_range.

    So all cases (THP on/off) are broken right now.

    To fix this is enough to replace mmu_notifier_invalidate_range with
    mmu_notifier_invalidate_range_start;mmu_notifier_invalidate_range_end.
    Either that or call multiple mmu_notifier_invalidate_page like
    before.

    2) address + (1UL << compound_order(page) is buggy, it should be
    PAGE_SIZE << compound_order(page), it's bytes not pages, 2M not
    512.

    3) The whole invalidate_range thing was an attempt to call a single
    invalidate while walking multiple 4k ptes that maps the same THP
    (after a pmd virtual split without physical compound page THP
    split).

    It's unclear if the rmap_walk will always provide an address that
    is 2M aligned as parameter to try_to_unmap_one, in presence of THP.
    I think it needs also an address &= (PAGE_SIZE <<
    compound_order(page)) - 1 to be safe"

    In general, we should stop making excuses for horrible MMU notifier
    users. It's much more important that the core VM is sane and safe, than
    letting MMU notifiers sleep.

    So if some MMU notifier is sleeping under a spinlock, we need to fix the
    notifier, not try to make excuses for that garbage in the core VM.

    Reported-and-tested-by: Bernhard Held
    Reported-and-tested-by: Adam Borowski
    Cc: Andrea Arcangeli
    Cc: Radim Krčmář
    Cc: Wanpeng Li
    Cc: Paolo Bonzini
    Cc: Takashi Iwai
    Cc: Nadav Amit
    Cc: Mike Galbraith
    Cc: Kirill A. Shutemov
    Cc: Jérôme Glisse
    Cc: axie
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Aug, 2017

1 commit

  • MMU notifiers can sleep, but in page_mkclean_one() we call
    mmu_notifier_invalidate_page() under page table lock.

    Let's instead use mmu_notifier_invalidate_range() outside
    page_vma_mapped_walk() loop.

    [jglisse@redhat.com: try_to_unmap_one() do not call mmu_notifier under ptl]
    Link: http://lkml.kernel.org/r/20170809204333.27485-1-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170804134928.l4klfcnqatni7vsc@black.fi.intel.com
    Fixes: c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Jérôme Glisse
    Reported-by: axie
    Cc: Alex Deucher
    Cc: "Writer, Tim"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

03 Aug, 2017

1 commit

  • Nadav Amit identified a theoritical race between page reclaim and
    mprotect due to TLB flushes being batched outside of the PTL being held.

    He described the race as follows:

    CPU0 CPU1
    ---- ----
    user accesses memory using RW PTE
    [PTE now cached in TLB]
    try_to_unmap_one()
    ==> ptep_get_and_clear()
    ==> set_tlb_ubc_flush_pending()
    mprotect(addr, PROT_READ)
    ==> change_pte_range()
    ==> [ PTE non-present - no flush ]

    user writes using cached RW PTE
    ...

    try_to_unmap_flush()

    The same type of race exists for reads when protecting for PROT_NONE and
    also exists for operations that can leave an old TLB entry behind such
    as munmap, mremap and madvise.

    For some operations like mprotect, it's not necessarily a data integrity
    issue but it is a correctness issue as there is a window where an
    mprotect that limits access still allows access. For munmap, it's
    potentially a data integrity issue although the race is massive as an
    munmap, mmap and return to userspace must all complete between the
    window when reclaim drops the PTL and flushes the TLB. However, it's
    theoritically possible so handle this issue by flushing the mm if
    reclaim is potentially currently batching TLB flushes.

    Other instances where a flush is required for a present pte should be ok
    as either the page lock is held preventing parallel reclaim or a page
    reference count is elevated preventing a parallel free leading to
    corruption. In the case of page_mkclean there isn't an obvious path
    that userspace could take advantage of without using the operations that
    are guarded by this patch. Other users such as gup as a race with
    reclaim looks just at PTEs. huge page variants should be ok as they
    don't race with reclaim. mincore only looks at PTEs. userfault also
    should be ok as if a parallel reclaim takes place, it will either fault
    the page back in or read some of the data before the flush occurs
    triggering a fault.

    Note that a variant of this patch was acked by Andy Lutomirski but this
    was for the x86 parts on top of his PCID work which didn't make the 4.13
    merge window as expected. His ack is dropped from this version and
    there will be a follow-on patch on top of PCID that will include his
    ack.

    [akpm@linux-foundation.org: tweak comments]
    [akpm@linux-foundation.org: fix spello]
    Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
    Reported-by: Nadav Amit
    Signed-off-by: Mel Gorman
    Cc: Andy Lutomirski
    Cc: [v4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

07 Jul, 2017

2 commits

  • lruvecs are at the intersection of the NUMA node and memcg, which is the
    scope for most paging activity.

    Introduce a convenient accounting infrastructure that maintains
    statistics per node, per memcg, and the lruvec itself.

    Then convert over accounting sites for statistics that are already
    tracked in both nodes and memcgs and can be easily switched.

    [hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code]
    Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org
    [hannes@cmpxchg.org: don't track uncharged pages at all
    Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org
    [hannes@cmpxchg.org: add missing free_percpu()]
    Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org
    [linux@roeck-us.net: hexagon: fix build error caused by include file order]
    Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net
    Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Guenter Roeck
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Using set_pte_at() does not do the right thing when putting down
    HWPOISON swap entries for hugepages on architectures that support
    contiguous ptes.

    Fix this problem by using set_huge_swap_pte_at() which was introduced to
    fix exactly this problem.

    Link: http://lkml.kernel.org/r/20170522133604.11392-7-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Acked-by: Steve Capper
    Cc: "Kirill A. Shutemov"
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Naoya Horiguchi
    Cc: Mike Kravetz
    Cc: Mark Rutland
    Cc: Hillf Danton
    Cc: Aneesh Kumar K.V
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     

24 May, 2017

1 commit

  • try_to_unmap_flush() used to open-code a rather x86-centric flush
    sequence: local_flush_tlb() + flush_tlb_others(). Rearrange the
    code so that the arch (only x86 for now) provides
    arch_tlbbatch_add_mm() and arch_tlbbatch_flush() and the core code
    calls those functions instead.

    I'll want this for x86 because, to enable address space ids, I can't
    support the flush_tlb_others() mode used by exising
    try_to_unmap_flush() implementation with good performance. I can
    support the new API fairly easily, though.

    I imagine that other architectures may be in a similar position.
    Architectures with strong remote flush primitives (arm64?) may have
    even worse performance problems with flush_tlb_others() the way that
    try_to_unmap_flush() uses it.

    Signed-off-by: Andy Lutomirski
    Acked-by: Kees Cook
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Nadav Amit
    Cc: Nadav Amit
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sasha Levin
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/19f25a8581f9fb77876b7ff3b001f89835e34ea3.1495492063.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski