30 Dec, 2020

1 commit

  • [ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]

    Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
    v2"), the code to check the secondary MMU's page table access bit is
    broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
    secondary MMU's page table before the check. More specifically for those
    secondary MMUs which unmap the memory in
    mmu_notifier_invalidate_range_start() like kvm.

    However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
    absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
    access check before trying to unmap the page. So, at worst the reclaim
    will miss accesses in a very short window if we remove page table access
    check in unmapping code.

    There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
    reclaim. From memcg reclaim the page_referenced() only account the
    accesses from the processes which are in the same memcg of the target page
    but the unmapping code is considering accesses from all the processes, so,
    decreasing the effectiveness of memcg reclaim.

    The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
    code.

    Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
    Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Shakeel Butt
     

15 Nov, 2020

1 commit

  • Qian Cai reported the following BUG in [1]

    LTP: starting move_pages12
    BUG: unable to handle page fault for address: ffffffffffffffe0
    ...
    RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
    Call Trace:
    rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
    try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
    migrate_pages+0x1005/0x1fb0
    move_pages_and_store_status.isra.47+0xd7/0x1a0
    __x64_sys_move_pages+0xa5c/0x1100
    do_syscall_64+0x5f/0x310
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Hugh Dickins diagnosed this as a migration bug caused by code introduced
    to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the
    routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
    flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for
    anon pages as the anon_vma_lock should be held in this case. Further
    analysis suggested that i_mmap_rwsem was not required to he held at all
    when calling try_to_unmap for anon pages as an anon page could never be
    part of a shared pmd mapping.

    Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
    to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to
    keep mapping valid while dropping page lock.

    This patch does the following:

    - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
    calling try_to_unmap.

    - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
    will now simply do a 'trylock' while still holding the page lock. If
    the trylock fails, it will return NULL. This could impact the
    callers:

    - migration calling code will receive -EAGAIN and retry up to the
    hard coded limit (10).

    - memory error code will treat the page as BUSY. This will force
    killing (SIGKILL) instead of SIGBUS any mapping tasks.

    Do note that this change in behavior only happens when there is a
    race. None of the standard kernel testing suites actually hit this
    race, but it is possible.

    [1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
    [2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/

    Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
    Reported-by: Qian Cai
    Suggested-by: Hugh Dickins
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Acked-by: Naoya Horiguchi
    Cc:
    Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

19 Oct, 2020

1 commit

  • There is no need to check if this process has the right to modify the
    specified process when they are same. And we could also skip the security
    hook call if a process is modifying its own pages. Add helper function to
    handle these.

    Suggested-by: Matthew Wilcox
    Signed-off-by: Hongxiang Lou
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Cc: Christopher Lameter
    Link: https://lkml.kernel.org/r/20200819083331.19012-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

17 Oct, 2020

1 commit

  • This patch changes the way we set and handle in-use poisoned pages. Until
    now, poisoned pages were released to the buddy allocator, trusting that
    the checks that take place at allocation time would act as a safe net and
    would skip that page.

    This has proved to be wrong, as we got some pfn walkers out there, like
    compaction, that all they care is the page to be in a buddy freelist.

    Although this might not be the only user, having poisoned pages in the
    buddy allocator seems a bad idea as we should only have free pages that
    are ready and meant to be used as such.

    Before explaining the taken approach, let us break down the kind of pages
    we can soft offline.

    - Anonymous THP (after the split, they end up being 4K pages)
    - Hugetlb
    - Order-0 pages (that can be either migrated or invalited)

    * Normal pages (order-0 and anon-THP)

    - If they are clean and unmapped page cache pages, we invalidate
    then by means of invalidate_inode_page().
    - If they are mapped/dirty, we do the isolate-and-migrate dance.

    Either way, do not call put_page directly from those paths. Instead, we
    keep the page and send it to page_handle_poison to perform the right
    handling.

    page_handle_poison sets the HWPoison flag and does the last put_page.

    Down the chain, we placed a check for HWPoison page in
    free_pages_prepare, that just skips any poisoned page, so those pages
    do not end up in any pcplist/freelist.

    After that, we set the refcount on the page to 1 and we increment
    the poisoned pages counter.

    If we see that the check in free_pages_prepare creates trouble, we can
    always do what we do for free pages:

    - wait until the page hits buddy's freelists
    - take it off, and flag it

    The downside of the above approach is that we could race with an
    allocation, so by the time we want to take the page off the buddy, the
    page has been already allocated so we cannot soft offline it.
    But the user could always retry it.

    * Hugetlb pages

    - We isolate-and-migrate them

    After the migration has been successful, we call dissolve_free_huge_page,
    and we set HWPoison on the page if we succeed.
    Hugetlb has a slightly different handling though.

    While for non-hugetlb pages we cared about closing the race with an
    allocation, doing so for hugetlb pages requires quite some additional
    and intrusive code (we would need to hook in free_huge_page and some other
    places).
    So I decided to not make the code overly complicated and just fail
    normally if the page we allocated in the meantime.

    We can always build on top of this.

    As a bonus, because of the way we handle now in-use pages, we no longer
    need the put-as-isolation-migratetype dance, that was guarding for poisoned
    pages to end up in pcplists.

    Signed-off-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Acked-by: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: Aneesh Kumar K.V
    Cc: Aristeu Rozanski
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Dmitry Yakunin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Oscar Salvador
    Cc: Qian Cai
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.de
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

14 Oct, 2020

3 commits

  • Device public memory never had an in tree consumer and was removed in
    commit 25b2995a35b6 ("mm: remove MEMORY_DEVICE_PUBLIC support"). Delete
    the obsolete comment.

    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Christoph Hellwig
    Link: http://lkml.kernel.org/r/20200827190735.12752-2-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • The variable struct migrate_vma->cpages is only used in
    migrate_vma_setup(). There is no need to decrement it in
    migrate_vma_finalize() since it is never checked.

    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Christoph Hellwig
    Link: http://lkml.kernel.org/r/20200827190735.12752-1-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

27 Sep, 2020

1 commit

  • PageTransHuge returns true for both thp and hugetlb, so thp stats was
    counting both thp and hugetlb migrations. Exclude hugetlb migration by
    setting is_thp variable right.

    Clean up thp handling code too when we are there.

    Fixes: 1a5bae25e3cf ("mm/vmstat: add events for THP migration without split")
    Signed-off-by: Zi Yan
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Cc: Anshuman Khandual
    Link: https://lkml.kernel.org/r/20200917210413.1462975-1-zi.yan@sent.com
    Signed-off-by: Linus Torvalds

    Zi Yan
     

25 Sep, 2020

1 commit


20 Sep, 2020

1 commit

  • hugetlbfs pages do not participate in memcg: so although they do find most
    of migrate_page_states() useful, it would be better if they did not call
    into mem_cgroup_migrate() - where Qian Cai reported that LTP's
    move_pages12 triggers the warning in Alex Shi's prospective commit
    "mm/memcg: warning on !memcg after readahead page charged".

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Alex Shi
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301359460.5954@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Sep, 2020

4 commits

  • The code to remove a migration PTE and replace it with a device private
    PTE was not copying the soft dirty bit from the migration entry. This
    could lead to page contents not being marked dirty when faulting the page
    back from device private memory.

    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Cc: Jerome Glisse
    Cc: Alistair Popple
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: Bharata B Rao
    Link: https://lkml.kernel.org/r/20200831212222.22409-3-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • Patch series "mm/migrate: preserve soft dirty in remove_migration_pte()".

    I happened to notice this from code inspection after seeing Alistair
    Popple's patch ("mm/rmap: Fixup copying of soft dirty and uffd ptes").

    This patch (of 2):

    The check for is_zone_device_page() and is_device_private_page() is
    unnecessary since the latter is sufficient to determine if the page is a
    device private page. Simplify the code for easier reading.

    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Cc: Jerome Glisse
    Cc: Alistair Popple
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: Bharata B Rao
    Link: https://lkml.kernel.org/r/20200831212222.22409-1-rcampbell@nvidia.com
    Link: https://lkml.kernel.org/r/20200831212222.22409-2-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • During memory migration a pte is temporarily replaced with a migration
    swap pte. Some pte bits from the existing mapping such as the soft-dirty
    and uffd write-protect bits are preserved by copying these to the
    temporary migration swap pte.

    However these bits are not stored at the same location for swap and
    non-swap ptes. Therefore testing these bits requires using the
    appropriate helper function for the given pte type.

    Unfortunately several code locations were found where the wrong helper
    function is being used to test soft_dirty and uffd_wp bits which leads to
    them getting incorrectly set or cleared during page-migration.

    Fix these by using the correct tests based on pte type.

    Fixes: a5430dda8a3a ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
    Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
    Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration")
    Signed-off-by: Alistair Popple
    Signed-off-by: Andrew Morton
    Reviewed-by: Peter Xu
    Cc: Jérôme Glisse
    Cc: John Hubbard
    Cc: Ralph Campbell
    Cc: Alistair Popple
    Cc:
    Link: https://lkml.kernel.org/r/20200825064232.10023-2-alistair@popple.id.au
    Signed-off-by: Linus Torvalds

    Alistair Popple
     
  • Commit f45ec5ff16a75 ("userfaultfd: wp: support swap and page migration")
    introduced support for tracking the uffd wp bit during page migration.
    However the non-swap PTE variant was used to set the flag for zone device
    private pages which are a type of swap page.

    This leads to corruption of the swap offset if the original PTE has the
    uffd_wp flag set.

    Fixes: f45ec5ff16a75 ("userfaultfd: wp: support swap and page migration")
    Signed-off-by: Alistair Popple
    Signed-off-by: Andrew Morton
    Reviewed-by: Peter Xu
    Cc: Jérôme Glisse
    Cc: John Hubbard
    Cc: Ralph Campbell
    Link: https://lkml.kernel.org/r/20200825064232.10023-1-alistair@popple.id.au
    Signed-off-by: Linus Torvalds

    Alistair Popple
     

15 Aug, 2020

1 commit

  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

9 commits

  • There is a well-defined migration target allocation callback. Use it.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-7-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There are some similar functions for migration target allocation. Since
    there is no fundamental difference, it's better to keep just one rather
    than keeping all variants. This patch implements base migration target
    allocation function. In the following patches, variants will be converted
    to use this function.

    Changes should be mechanical, but, unfortunately, there are some
    differences. First, some callers' nodemask is assgined to NULL since NULL
    nodemask will be considered as all available nodes, that is,
    &node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is
    redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
    user provided gfp_mask has it. This is because future caller of this
    function requires to set this node constaint. Lastly, if provided nodeid
    is NUMA_NO_NODE, nodeid is set up to the node where migration source
    lives. It helps to remove simple wrappers for setting up the nodeid.

    Note that PageHighmem() call in previous function is changed to open-code
    "is_highmem_idx()" since it provides more readability.

    [akpm@linux-foundation.org: tweak patch title, per Vlastimil]
    [akpm@linux-foundation.org: fix typo in comment]

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • …egular THP allocations

    new_page_nodemask is a migration callback and it tries to use a common gfp
    flags for the target page allocation whether it is a base page or a THP.
    The later only adds GFP_TRANSHUGE to the given mask. This results in the
    allocation being slightly more aggressive than necessary because the
    resulting gfp mask will contain also __GFP_RECLAIM_KSWAPD. THP
    allocations usually exclude this flag to reduce over eager background
    reclaim during a high THP allocation load which has been seen during large
    mmaps initialization. There is no indication that this is a problem for
    migration as well but theoretically the same might happen when migrating
    large mappings to a different node. Make the migration callback
    consistent with regular THP allocations.

    [akpm@linux-foundation.org: fix comment typo, per Vlastimil]

    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: Roman Gushchin <guro@fb.com>
    Link: http://lkml.kernel.org/r/1594622517-20681-5-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Joonsoo Kim
     
  • There is no difference between two migration callback functions,
    alloc_huge_page_node() and alloc_huge_page_nodemask(), except
    __GFP_THISNODE handling. It's redundant to have two almost similar
    functions in order to handle this flag. So, this patch tries to remove
    one by introducing a new argument, gfp_mask, to
    alloc_huge_page_nodemask().

    After introducing gfp_mask argument, it's caller's job to provide correct
    gfp_mask. So, every callsites for alloc_huge_page_nodemask() are changed
    to provide gfp_mask.

    Note that it's safe to remove a node id check in alloc_huge_page_node()
    since there is no caller passing NUMA_NO_NODE as a node id.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Reviewed-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-4-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It's not performance sensitive function. Move it to .c. This is a
    preparation step for future change.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Acked-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Drop the repeated word "and".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-8-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Add following new vmstat events which will help in validating THP
    migration without split. Statistics reported through these new VM events
    will help in performance debugging.

    1. THP_MIGRATION_SUCCESS
    2. THP_MIGRATION_FAILURE
    3. THP_MIGRATION_SPLIT

    In addition, these new events also update normal page migration statistics
    appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE. While here,
    this updates current trace event 'mm_migrate_pages' to accommodate now
    available THP statistics.

    [akpm@linux-foundation.org: s/hpage_nr_pages/thp_nr_pages/]
    [ziy@nvidia.com: v2]
    Link: http://lkml.kernel.org/r/C5E3C65C-8253-4638-9D3C-71A61858BB8B@nvidia.com
    [anshuman.khandual@arm.com: s/thp_nr_pages/hpage_nr_pages/]
    Link: http://lkml.kernel.org/r/1594287583-16568-1-git-send-email-anshuman.khandual@arm.com

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Zi Yan
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Zi Yan
    Cc: John Hubbard
    Cc: Naoya Horiguchi
    Link: http://lkml.kernel.org/r/1594080415-27924-1-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Patch series "mm/migrate: optimize migrate_vma_setup() for holes".

    A simple optimization for migrate_vma_*() when the source vma is not an
    anonymous vma and a new test case to exercise it.

    This patch (of 2):

    When migrating system memory to device private memory, if the source
    address range is a valid VMA range and there is no memory or a zero page,
    the source PFN array is marked as valid but with no PFN.

    This lets the device driver allocate private memory and clear it, then
    insert the new device private struct page into the CPU's page tables when
    migrate_vma_pages() is called. migrate_vma_pages() only inserts the new
    page if the VMA is an anonymous range.

    There is no point in telling the device driver to allocate device private
    memory and then not migrate the page. Instead, mark the source PFN array
    entries as not migrating to avoid this overhead.

    [rcampbell@nvidia.com: v2]
    Link: http://lkml.kernel.org/r/20200710194840.7602-2-rcampbell@nvidia.com

    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: "Bharata B Rao"
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200710194840.7602-1-rcampbell@nvidia.com
    Link: http://lkml.kernel.org/r/20200709165711.26584-1-rcampbell@nvidia.com
    Link: http://lkml.kernel.org/r/20200709165711.26584-2-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • In current implementation, newly created or swap-in anonymous page is
    started on active list. Growing active list results in rebalancing
    active/inactive list so old pages on active list are demoted to inactive
    list. Hence, the page on active list isn't protected at all.

    Following is an example of this situation.

    Assume that 50 hot pages on active list. Numbers denote the number of
    pages on active/inactive list (active | inactive).

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(uo) | 50(h)

    3. workload: another 50 newly created (used-once) pages
    50(uo) | 50(uo), swap-out 50(h)

    This patch tries to fix this issue. Like as file LRU, newly created or
    swap-in anonymous pages will be inserted to the inactive list. They are
    promoted to active list if enough reference happens. This simple
    modification changes the above example as following.

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(h) | 50(uo)

    3. workload: another 50 newly created (used-once) pages
    50(h) | 50(uo), swap-out 50(uo)

    As you can see, hot pages on active list would be protected.

    Note that, this implementation has a drawback that the page cannot be
    promoted and will be swapped-out if re-access interval is greater than the
    size of inactive list but less than the size of total(active+inactive).
    To solve this potential issue, following patch will apply workingset
    detection similar to the one that's already applied to file LRU.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

08 Aug, 2020

1 commit

  • On x86_64, when CONFIG_MMU_NOTIFIER is not set/enabled, there is a
    compiler error:

    mm/migrate.c: In function 'migrate_vma_collect':
    mm/migrate.c:2481:7: error: 'struct mmu_notifier_range' has no member named 'migrate_pgmap_owner'
    range.migrate_pgmap_owner = migrate->pgmap_owner;
    ^

    Fixes: 998427b3ad2c ("mm/notifier: add migration invalidation type")
    Reported-by: Randy Dunlap
    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Tested-by: Randy Dunlap
    Acked-by: Randy Dunlap
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Christoph Hellwig
    Cc: "Jason Gunthorpe"
    Link: http://lkml.kernel.org/r/20200806193353.7124-1-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

06 Aug, 2020

1 commit

  • Pull hmm updates from Jason Gunthorpe:
    "Ralph has been working on nouveau's use of hmm_range_fault() and
    migrate_vma() which resulted in this small series. It adds reporting
    of the page table order from hmm_range_fault() and some optimization
    of migrate_vma():

    - Report the size of the page table mapping out of hmm_range_fault().

    This makes it easier to establish a large/huge/etc mapping in the
    device's page table.

    - Allow devices to ignore the invalidations during migration in cases
    where the migration is not going to change pages.

    For instance migrating pages to a device does not require the
    device to invalidate pages already in the device.

    - Update nouveau and hmm_tests to use the above"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
    mm/hmm/test: use the new migration invalidation
    nouveau/svm: use the new migration invalidation
    mm/notifier: add migration invalidation type
    mm/migrate: add a flags parameter to migrate_vma
    nouveau: fix storing invalid ptes
    nouveau/hmm: support mapping large sysmem pages
    nouveau: fix mapping 2MB sysmem pages
    nouveau/hmm: fault one page at a time
    mm/hmm: add tests for hmm_pfn_to_map_order()
    mm/hmm: provide the page mapping order in hmm_range_fault()

    Linus Torvalds
     

29 Jul, 2020

2 commits

  • Currently migrate_vma_setup() calls mmu_notifier_invalidate_range_start()
    which flushes all device private page mappings whether or not a page is
    being migrated to/from device private memory.

    In order to not disrupt device mappings that are not being migrated, shift
    the responsibility for clearing device private mappings to the device
    driver and leave CPU page table unmapping handled by
    migrate_vma_setup().

    To support this, the caller of migrate_vma_setup() should always set
    struct migrate_vma::pgmap_owner to a non NULL value that matches the
    device private page->pgmap->owner. This value is then passed to the struct
    mmu_notifier_range with a new event type which the driver's invalidation
    function can use to avoid device MMU invalidations.

    Link: https://lore.kernel.org/r/20200723223004.9586-4-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell
     
  • The src_owner field in struct migrate_vma is being used for two purposes,
    it acts as a selection filter for which types of pages are to be migrated
    and it identifies device private pages owned by the caller.

    Split this into separate parameters so the src_owner field can be used
    just to identify device private pages owned by the caller of
    migrate_vma_setup().

    Rename the src_owner field to pgmap_owner to reflect it is now used only
    to identify which device private pages to migrate.

    Link: https://lore.kernel.org/r/20200723223004.9586-3-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Reviewed-by: Bharata B Rao
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell
     

09 Jul, 2020

1 commit

  • I realize that we fairly recently raised it to 4.8, but the fact is, 4.9
    is a much better minimum version to target.

    We have a number of workarounds for actual bugs in pre-4.9 gcc versions
    (including things like internal compiler errors on ARM), but we also
    have some syntactic workarounds for lacking features.

    In particular, raising the minimum to 4.9 means that we can now just
    assume _Generic() exists, which is likely the much better replacement
    for a lot of very convoluted built-time magic with conditionals on
    sizeof and/or __builtin_choose_expr() with same_type() etc.

    Using _Generic also means that you will need to have a very recent
    version of 'sparse', but thats easy to build yourself, and much less of
    a hassle than some old gcc version can be.

    The latest (in a long string) of reasons for minimum compiler version
    upgrades was commit 5435f73d5c4a ("efi/x86: Fix build with gcc 4").

    Ard points out that RHEL 7 uses gcc-4.8, but the people who stay back on
    old RHEL versions persumably also don't build their own kernels anyway.
    And maybe they should cross-built or just have a little side affair with
    a newer compiler?

    Acked-by: Ard Biesheuvel
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

10 Jun, 2020

3 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Convert comments that reference old mmap_sem APIs to reference
    corresponding new mmap locking APIs instead.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

04 Jun, 2020

5 commits

  • Swapin faults were the last event to charge pages after they had already
    been put on the LRU list. Now that we charge directly on swapin, the
    lrucare portion of the charge code is unused.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Cc: Shakeel Butt
    Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • With the page->mapping requirement gone from memcg, we can charge anon and
    file-thp pages in one single step, right after they're allocated.

    This removes two out of three API calls - especially the tricky commit
    step that needed to happen at just the right time between when the page is
    "set up" and when it's "published" - somewhat vague and fluid concepts
    that varied by page type. All we need is a freshly allocated page and a
    memcg context to charge.

    v2: prevent double charges on pre-allocated hugepages in khugepaged

    [hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
    Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg maintains a private MEMCG_RSS counter. This divergence from the
    generic VM accounting means unnecessary code overhead, and creates a
    dependency for memcg that page->mapping is set up at the time of charging,
    so that page types can be told apart.

    Convert the generic accounting sites to mod_lruvec_page_state and friends
    to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use
    lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
    same way we do for NR_FILE_MAPPED.

    With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
    counter, this patch finally eliminates the need to have page->mapping set
    up at charge time. However, we need to have page->mem_cgroup set up by
    the time rmap runs and does the accounting, so switch the commit and the
    rmap callbacks around.

    v2: fix temporary accounting bug by switching rmapcommit (Joonsoo)

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg maintains private MEMCG_CACHE and NR_SHMEM counters. This
    divergence from the generic VM accounting means unnecessary code overhead,
    and creates a dependency for memcg that page->mapping is set up at the
    time of charging, so that page types can be told apart.

    Convert the generic accounting sites to mod_lruvec_page_state and friends
    to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM.
    The page is already locked in these places, so page->mem_cgroup is stable;
    we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure
    it's set up in time.

    Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
    NR_SHMEM accounting sites.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg charging API carries a boolean @compound parameter that tells
    whether the page we're dealing with is a hugepage.
    mem_cgroup_commit_charge() has another boolean @lrucare that indicates
    whether the page needs LRU locking or not while charging. The majority of
    callsites know those parameters at compile time, which results in a lot of
    naked "false, false" argument lists. This makes for cryptic code and is a
    breeding ground for subtle mistakes.

    Thankfully, the huge page state can be inferred from the page itself and
    doesn't need to be passed along. This is safe because charging completes
    before the page is published and somebody may split it.

    Simplify the callsites by removing @compound, and let memcg infer the
    state by using hpage_nr_pages() unconditionally. That function does
    PageTransHuge() to identify huge pages, which also helpfully asserts that
    nobody passes in tail pages by accident.

    The following patches will introduce a new charging API, best not to carry
    over unnecessary weight.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Reviewed-by: Joonsoo Kim
    Reviewed-by: Shakeel Butt
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Jun, 2020

3 commits

  • Just finished bisecting mmotm, to find why a test which used to take
    four minutes now took more than an hour: the __buffer_migrate_page()
    cleanup left behind a get_page() which attach_page_private() now does.

    Fixes: cd0f37154443 ("mm/migrate.c: call detach_page_private to cleanup code")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We can cleanup code a little by call detach_page_private here.

    [akpm@linux-foundation.org: use attach_page_private(), per Dave]
    http://lkml.kernel.org/r/20200521225220.GV2005@dread.disaster.area
    [akpm@linux-foundation.org: clear PagePrivate]
    Signed-off-by: Guoqing Jiang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: John Hubbard
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200519214049.15179-1-guoqing.jiang@cloud.ionos.com
    Signed-off-by: Linus Torvalds

    Guoqing Jiang
     
  • Implement the new readahead aop and convert all callers (block_dev,
    exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
    reiserfs & udf).

    The callers are all trivial except for GFS2 & OCFS2.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Junxiao Bi # ocfs2
    Reviewed-by: Joseph Qi # ocfs2
    Reviewed-by: Dave Chinner
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-17-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)