21 Oct, 2020

1 commit


04 Sep, 2020

1 commit

  • Similarly to arch_validate_prot() called from do_mprotect_pkey(), an
    architecture may need to sanity-check the new vm_flags.

    Define a dummy function always returning true. In addition to
    do_mprotect_pkey(), also invoke it from mmap_region() prior to updating
    vma->vm_page_prot to allow the architecture code to veto potentially
    inconsistent vm_flags.

    Signed-off-by: Catalin Marinas
    Acked-by: Andrew Morton

    Catalin Marinas
     

24 Jun, 2020

1 commit


10 Jun, 2020

3 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The include/linux/pgtable.h is going to be the home of generic page table
    manipulation functions.

    Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
    make the latter include asm/pgtable.h.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

11 Apr, 2020

2 commits

  • Steps along the way to the 5.7-rc1 merge.

    Signed-off-by: Greg Kroah-Hartman
    Change-Id: Iaf237a174205979344cfa76274198e87e2ba7799

    Greg Kroah-Hartman
     
  • There are many places where all basic VMA access flags (read, write,
    exec) are initialized or checked against as a group. One such example
    is during page fault. Existing vma_is_accessible() wrapper already
    creates the notion of VMA accessibility as a group access permissions.

    Hence lets just create VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which
    will not only reduce code duplication but also extend the VMA
    accessibility concept in general.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Mark Salter
    Cc: Nick Hu
    Cc: Ley Foon Tan
    Cc: Michael Ellerman
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Guan Xuetao
    Cc: Dave Hansen
    Cc: Thomas Gleixner
    Cc: Rob Springer
    Cc: Greg Kroah-Hartman
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/1583391014-8170-3-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

10 Apr, 2020

1 commit


08 Apr, 2020

4 commits

  • For either swap and page migration, we all use the bit 2 of the entry to
    identify whether this entry is uffd write-protected. It plays a similar
    role as the existing soft dirty bit in swap entries but only for keeping
    the uffd-wp tracking for a specific PTE/PMD.

    Something special here is that when we want to recover the uffd-wp bit
    from a swap/migration entry to the PTE bit we'll also need to take care of
    the _PAGE_RW bit and make sure it's cleared, otherwise even with the
    _PAGE_UFFD_WP bit we can't trap it at all.

    In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
    That can lead to data mismatch if the page that we are going to write
    protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch
    also applies/removes the uffd-wp bit even for the swap entries.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
    change_protection() when used with uffd-wp and make sure the two new flags
    are exclusively used. Then,

    - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
    when a range of memory is write protected by uffd

    - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
    _PAGE_RW when write protection is resolved from userspace

    And use this new interface in mwriteprotect_range() to replace the old
    MM_CP_DIRTY_ACCT.

    Do this change for both PTEs and huge PMDs. Then we can start to identify
    which PTE/PMD is write protected by general (e.g., COW or soft dirty
    tracking), and which is for userfaultfd-wp.

    Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
    into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we
    can be even more strict when detecting uffd-wp page faults in either
    do_wp_page() or wp_huge_pmd().

    After we're with _PAGE_UFFD_WP, a special case is when a page is both
    protected by the general COW logic and also userfault-wp. Here the
    userfault-wp will have higher priority and will be handled first. Only
    after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
    the general COW. These are the steps on what will happen with such a
    page:

    1. CPU accesses write protected shared page (so both protected by
    general COW and uffd-wp), blocked by uffd-wp first because in
    do_wp_page we'll handle uffd-wp first, so it has higher priority
    than general COW.

    2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
    to remove the uffd-wp bit upon the PTE/PMD. However here we
    still keep the write bit cleared. Notify the blocked CPU.

    3. The blocked CPU resumes the page fault process with a fault
    retry, during retry it'll notice it was not with the uffd-wp bit
    this time but it is still write protected by general COW, then
    it'll go though the COW path in the fault handler, copy the page,
    apply write bit where necessary, and retry again.

    4. The CPU will be able to access this page with write bit set.

    Suggested-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Brian Geffon
    Cc: Pavel Emelyanov
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: Martin Cracauer
    Cc: Mel Gorman
    Cc: Bobby Powers
    Cc: Mike Rapoport
    Cc: "Kirill A . Shutemov"
    Cc: Maya Gokhale
    Cc: Johannes Weiner
    Cc: Marty McFadden
    Cc: Denis Plotnikov
    Cc: Hugh Dickins
    Cc: "Dr . David Alan Gilbert"
    Cc: Jerome Glisse
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • change_protection() was used by either the NUMA or mprotect() code,
    there's one parameter for each of the callers (dirty_accountable and
    prot_numa). Further, these parameters are passed along the calls:

    - change_protection_range()
    - change_p4d_range()
    - change_pud_range()
    - change_pmd_range()
    - ...

    Now we introduce a flag for change_protect() and all these helpers to
    replace these parameters. Then we can avoid passing multiple parameters
    multiple times along the way.

    More importantly, it'll greatly simplify the work if we want to introduce
    any new parameters to change_protection(). In the follow up patches, a
    new parameter for userfaultfd write protection will be introduced.

    No functional change at all.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Some comments for MADV_FREE is revised and added to help people understand
    the MADV_FREE code, especially the page flag, PG_swapbacked. This makes
    page_is_file_cache() isn't consistent with its comments. So the function
    is renamed to page_is_file_lru() to make them consistent again. All these
    are put in one patch as one logical change.

    Suggested-by: David Hildenbrand
    Suggested-by: Johannes Weiner
    Suggested-by: David Rientjes
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Acked-by: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     

09 Mar, 2020

1 commit


06 Mar, 2020

1 commit

  • : A user reported a bug against a distribution kernel while running a
    : proprietary workload described as "memory intensive that is not swapping"
    : that is expected to apply to mainline kernels. The workload is
    : read/write/modifying ranges of memory and checking the contents. They
    : reported that within a few hours that a bad PMD would be reported followed
    : by a memory corruption where expected data was all zeros. A partial
    : report of the bad PMD looked like
    :
    : [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
    : [ 5195.341184] ------------[ cut here ]------------
    : [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
    : ....
    : [ 5195.410033] Call Trace:
    : [ 5195.410471] [] change_protection_range+0x7dd/0x930
    : [ 5195.410716] [] change_prot_numa+0x18/0x30
    : [ 5195.410918] [] task_numa_work+0x1fe/0x310
    : [ 5195.411200] [] task_work_run+0x72/0x90
    : [ 5195.411246] [] exit_to_usermode_loop+0x91/0xc2
    : [ 5195.411494] [] prepare_exit_to_usermode+0x31/0x40
    : [ 5195.411739] [] retint_user+0x8/0x10
    :
    : Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
    : was a false detection. The bug does not trigger if automatic NUMA
    : balancing or transparent huge pages is disabled.
    :
    : The bug is due a race in change_pmd_range between a pmd_trans_huge and
    : pmd_nond_or_clear_bad check without any locks held. During the
    : pmd_trans_huge check, a parallel protection update under lock can have
    : cleared the PMD and filled it with a prot_numa entry between the transhuge
    : check and the pmd_none_or_clear_bad check.
    :
    : While this could be fixed with heavy locking, it's only necessary to make
    : a copy of the PMD on the stack during change_pmd_range and avoid races. A
    : new helper is created for this as the check if quite subtle and the
    : existing similar helpful is not suitable. This passed 154 hours of
    : testing (usually triggers between 20 minutes and 24 hours) without
    : detecting bad PMDs or corruption. A basic test of an autonuma-intensive
    : workload showed no significant change in behaviour.

    Although Mel withdrew the patch on the face of LKML comment
    https://lkml.org/lkml/2017/4/10/922 the race window aforementioned is
    still open, and we have reports of Linpack test reporting bad residuals
    after the bad PMD warning is observed. In addition to that, bad
    rss-counter and non-zero pgtables assertions are triggered on mm teardown
    for the task hitting the bad PMD.

    host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
    ....
    host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
    host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096

    The issue is observed on a v4.18-based distribution kernel, but the race
    window is expected to be applicable to mainline kernels, as well.

    [akpm@linux-foundation.org: fix comment typo, per Rafael]
    Signed-off-by: Andrew Morton
    Signed-off-by: Rafael Aquini
    Signed-off-by: Mel Gorman
    Cc:
    Cc: Zi Yan
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200216191800.22423-1-aquini@redhat.com
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Dec, 2019

1 commit


02 Dec, 2019

1 commit

  • In auto NUMA balancing page table scanning, if the pte_protnone() is
    true, the PTE needs not to be changed because it's in target state
    already. So other checking on corresponding struct page is unnecessary
    too.

    So, if we check pte_protnone() firstly for each PTE, we can avoid
    unnecessary struct page accessing, so that reduce the cache footprint of
    NUMA balancing page table scanning.

    In the performance test of pmbench memory accessing benchmark with 80:20
    read/write ratio and normal access address distribution on a 2 socket
    Intel server with Optance DC Persistent Memory, perf profiling shows
    that the autonuma page table scanning time reduces from 1.23% to 0.97%
    (that is, reduced 21%) with the patch.

    Link: http://lkml.kernel.org/r/20191101075727.26683-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

02 Oct, 2019

1 commit


26 Sep, 2019

1 commit

  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    This patch allows tagged pointers to be passed to the following memory
    syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
    mremap, msync, munlock, move_pages.

    The mmap and mremap syscalls do not currently accept tagged addresses.
    Architectures may interpret the tag as a background colour for the
    corresponding vma.

    Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Khalid Aziz
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Auger
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

23 Sep, 2019

1 commit


07 Sep, 2019

2 commits

  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

21 May, 2019

1 commit


15 May, 2019

3 commits

  • Since 0cbe3e26abe0 ("mm: update ptep_modify_prot_start/commit to take
    vm_area_struct as arg") the only place that uses the local 'mm' variable
    in change_pte_range() is the call to set_pte_at().

    Many architectures define set_pte_at() as macro that does not use the 'mm'
    parameter, which generates the following compilation warning:

    CC mm/mprotect.o
    mm/mprotect.c: In function 'change_pte_range':
    mm/mprotect.c:42:20: warning: unused variable 'mm' [-Wunused-variable]
    struct mm_struct *mm = vma->vm_mm;
    ^~

    Fix it by passing vma->mm to set_pte_at() and dropping the local 'mm'
    variable in change_pte_range().

    [liu.song.a23@gmail.com: fix missed conversions]
    Link: http://lkml.kernel.org/r/CAPhsuW6wcQgYLHNdBdw6m0YiR4RWsS4XzfpSKU7wBLLeOCTbpw@mail.gmail.comLink: http://lkml.kernel.org/r/1557305432-4940-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

04 May, 2019

2 commits

  • Change-Id: I4380c68c3474026a42ffa9f95c525f9a563ba7a3

    Todd Kjos
     
  • Userspace processes often have multiple allocators that each do
    anonymous mmaps to get memory. When examining memory usage of
    individual processes or systems as a whole, it is useful to be
    able to break down the various heaps that were allocated by
    each layer and examine their size, RSS, and physical memory
    usage.

    This patch adds a user pointer to the shared union in
    vm_area_struct that points to a null terminated string inside
    the user process containing a name for the vma. vmas that
    point to the same address will be merged, but vmas that
    point to equivalent strings at different addresses will
    not be merged.

    Userspace can set the name for a region of memory by calling
    prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
    Setting the name to NULL clears it.

    The names of named anonymous vmas are shown in /proc/pid/maps
    as [anon:] and in /proc/pid/smaps in a new "Name" field
    that is only present for named vmas. If the userspace pointer
    is no longer valid all or part of the name will be replaced
    with "".

    The idea to store a userspace pointer to reduce the complexity
    within mm (at the expense of the complexity of reading
    /proc/pid/mem) came from Dave Hansen. This results in no
    runtime overhead in the mm subsystem other than comparing
    the anon_name pointers when considering vma merging. The pointer
    is stored in a union with fieds that are only used on file-backed
    mappings, so it does not increase memory usage.

    Includes fix from Jed Davis for typo in
    prctl_set_vma_anon_name, which could attempt to set the name
    across two vmas at the same time due to a typo, which might
    corrupt the vma list. Fix it to use tmp instead of end to limit
    the name setting to a single vma at a time.

    Bug: 120441514
    Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f
    Signed-off-by: Dmitry Shmidt
    [AmitP: Fix get_user_pages_remote() call to align with upstream commit
    5b56d49fc31d ("mm: add locked parameter to get_user_pages_remote()")]
    Signed-off-by: Amit Pundir

    Colin Cross
     

06 Mar, 2019

2 commits

  • Architectures like ppc64 require to do a conditional tlb flush based on
    the old and new value of pte. Enable that by passing old pte value as
    the arg.

    Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Paul Mackerras
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Patch series "NestMMU pte upgrade workaround for mprotect", v5.

    We can upgrade pte access (R -> RW transition) via mprotect. We need to
    make sure we follow the recommended pte update sequence as outlined in
    commit bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to
    handle nest MMU hang") for such updates. This patch series does that.

    This patch (of 5):

    Some architectures may want to call flush_tlb_range from these helpers.

    Link: http://lkml.kernel.org/r/20190116085035.29729-2-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Nicholas Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

29 Dec, 2018

1 commit

  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

21 Jun, 2018

1 commit

  • For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
    table entry. This sets the high bits in the CPU's address space, thus
    making sure to point to not point an unmapped entry to valid cached memory.

    Some server system BIOSes put the MMIO mappings high up in the physical
    address space. If such an high mapping was mapped to unprivileged users
    they could attack low memory by setting such a mapping to PROT_NONE. This
    could happen through a special device driver which is not access
    protected. Normal /dev/mem is of course access protected.

    To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.

    Valid page mappings are allowed because the system is then unsafe anyways.

    It's not expected that users commonly use PROT_NONE on MMIO. But to
    minimize any impact this is only enforced if the mapping actually refers to
    a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
    the check for root.

    For mmaps this is straight forward and can be handled in vm_insert_pfn and
    in remap_pfn_range().

    For mprotect it's a bit trickier. At the point where the actual PTEs are
    accessed a lot of state has been changed and it would be difficult to undo
    on an error. Since this is a uncommon case use a separate early page talk
    walk pass for MMIO PROT_NONE mappings that checks for this condition
    early. For non MMIO and non PROT_NONE there are no changes.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Dave Hansen

    Andi Kleen
     

12 Apr, 2018

1 commit

  • change_pte_range is called from task work context to mark PTEs for
    receiving NUMA faulting hints. If the marked pages are dirty then
    migration may fail. Some filesystems cannot migrate dirty pages without
    blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
    Even when they can, it can be a waste of cycles when the pages are
    shared forcing higher scan rates. This patch avoids marking shared
    dirty pages for hinting faults but also will skip a migration if the
    page was dirtied after the scanner updated a clean page.

    This is most noticeable running the NASA Parallel Benchmark when backed
    by btrfs, the default root filesystem for some distributions, but also
    noticeable when using XFS.

    The following are results from a 4-socket machine running a 4.16-rc4
    kernel with some scheduler patches that are pending for the next merge
    window.

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1
    Time cg.D 459.07 ( 0.00%) 444.21 ( 3.24%)
    Time ep.D 76.96 ( 0.00%) 77.69 ( -0.95%)
    Time is.D 25.55 ( 0.00%) 27.85 ( -9.00%)
    Time lu.D 601.58 ( 0.00%) 596.87 ( 0.78%)
    Time mg.D 107.73 ( 0.00%) 108.22 ( -0.45%)

    is.D regresses slightly in terms of absolute time but note that that
    particular load varies quite a bit from run to run. The more relevant
    observation is the total system CPU usage.

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1
    User 71471.91 70627.04
    System 11078.96 8256.13
    Elapsed 661.66 632.74

    That is a substantial drop in system CPU usage and overall the workload
    completes faster. The NUMA balancing statistics are also interesting

    NUMA base PTE updates 111407972 139848884
    NUMA huge PMD updates 206506 264869
    NUMA page range updates 217139044 275461812
    NUMA hint faults 4300924 3719784
    NUMA hint local faults 3012539 3416618
    NUMA hint local percent 70 91
    NUMA pages migrated 1517487 1358420

    While more PTEs are scanned due to changes in what faults are gathered,
    it's clear that a far higher percentage of faults are local as the bulk
    of the remote hits were dirty pages that, in this case with btrfs, had
    no chance of migrating.

    The following is a comparison when using XFS as that is a more realistic
    filesystem choice for a data partition

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1r47
    Time cg.D 485.28 ( 0.00%) 442.62 ( 8.79%)
    Time ep.D 77.68 ( 0.00%) 77.54 ( 0.18%)
    Time is.D 26.44 ( 0.00%) 24.79 ( 6.24%)
    Time lu.D 597.46 ( 0.00%) 597.11 ( 0.06%)
    Time mg.D 142.65 ( 0.00%) 105.83 ( 25.81%)

    That is a reasonable gain on two relatively long-lived workloads. While
    not presented, there is also a substantial drop in system CPu usage and
    the NUMA balancing stats show similar improvements in locality as btrfs
    did.

    Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 Mar, 2018

2 commits

  • When protection bits are changed on a VMA, some of the architecture
    specific flags should be cleared as well. An examples of this are the
    PKEY flags on x86. This patch expands the current code that clears
    PKEY flags for x86, to support similar functionality for other
    architectures as well.

    Signed-off-by: Khalid Aziz
    Cc: Khalid Aziz
    Reviewed-by: Anthony Yznaga
    Acked-by: Andrew Morton
    Signed-off-by: David S. Miller

    Khalid Aziz
     
  • A protection flag may not be valid across entire address space and
    hence arch_validate_prot() might need the address a protection bit is
    being set on to ensure it is a valid protection flag. For example, sparc
    processors support memory corruption detection (as part of ADI feature)
    flag on memory addresses mapped on to physical RAM but not on PFN mapped
    pages or addresses mapped on to devices. This patch adds address to the
    parameters being passed to arch_validate_prot() so protection bits can
    be validated in the relevant context.

    Signed-off-by: Khalid Aziz
    Cc: Khalid Aziz
    Reviewed-by: Anthony Yznaga
    Acked-by: Michael Ellerman (powerpc)
    Acked-by: Andrew Morton
    Signed-off-by: David S. Miller

    Khalid Aziz
     

01 Feb, 2018

1 commit

  • Workloads consisting of a large number of processes running the same
    program with a very large shared data segment may experience performance
    problems when numa balancing attempts to migrate the shared cow pages.
    This manifests itself with many processes or tasks in
    TASK_UNINTERRUPTIBLE state waiting for the shared pages to be migrated.

    The program listed below simulates the conditions with these results
    when run with 288 processes on a 144 core/8 socket machine.

    Average throughput Average throughput Average throughput
    with numa_balancing=0 with numa_balancing=1 with numa_balancing=1
    without the patch with the patch
    --------------------- --------------------- ---------------------
    2118782 2021534 2107979

    Complex production environments show less variability and fewer poorly
    performing outliers accompanied with a smaller number of processes
    waiting on NUMA page migration with this patch applied. In some cases,
    %iowait drops from 16%-26% to 0.

    // SPDX-License-Identifier: GPL-2.0
    /*
    * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
    */
    #include
    #include
    #include
    #include

    int a[1000000] = {13};

    int main(int argc, const char **argv)
    {
    int n = 0;
    int i;
    pid_t pid;
    int stat;
    int *count_array;
    int cpu_count = 288;
    long total = 0;

    struct timeval t1, t2 = {(argc > 1 ? atoi(argv[1]) : 10), 0};

    if (argc > 2)
    cpu_count = atoi(argv[2]);

    count_array = mmap(NULL, cpu_count * sizeof(int),
    (PROT_READ|PROT_WRITE),
    (MAP_SHARED|MAP_ANONYMOUS), 0, 0);

    if (count_array == MAP_FAILED) {
    perror("mmap:");
    return 0;
    }

    for (i = 0; i < cpu_count; ++i) {
    pid = fork();
    if (pid < 0)
    break;
    }

    for (i = 0; i < cpu_count; ++i)
    total += count_array[i];

    printf("Total %ld\n", total);
    munmap(count_array, cpu_count * sizeof(int));
    return 0;
    }

    gettimeofday(&t1, 0);
    timeradd(&t1, &t2, &t1);
    while (timercmp(&t2, &t1, < 1000000; j++)
    b += a[j];
    gettimeofday(&t2, 0);
    n++;
    }
    count_array[i] = n;
    return 0;
    }

    This patch changes change_pte_range() to skip shared copy-on-write pages
    when called from change_prot_numa().

    NOTE: change_prot_numa() is nominally called from task_numa_work() and
    queue_pages_test_walk(). task_numa_work() is the auto NUMA balancing
    path, and queue_pages_test_walk() is part of explicit NUMA policy
    management. However, queue_pages_test_walk() only calls
    change_prot_numa() when MPOL_MF_LAZY is specified and currently that is
    not allowed, so change_prot_numa() is only called from auto NUMA
    balancing.

    In the case of explicit NUMA policy management, shared pages are not
    migrated unless MPOL_MF_MOVE_ALL is specified, and MPOL_MF_MOVE_ALL
    depends on CAP_SYS_NICE. Currently, there is no way to pass information
    about MPOL_MF_MOVE_ALL to change_pte_range. This will have to be fixed
    if MPOL_MF_LAZY is enabled and MPOL_MF_MOVE_ALL is to be honored in lazy
    migration mode.

    task_numa_work() skips the read-only VMAs of programs and shared
    libraries.

    Link: http://lkml.kernel.org/r/1516751617-7369-1-git-send-email-henry.willard@oracle.com
    Signed-off-by: Henry Willard
    Reviewed-by: Håkon Bugge
    Reviewed-by: Steve Sistare
    Acked-by: Mel Gorman
    Cc: Kate Stewart
    Cc: Zi Yan
    Cc: Philippe Ombredanne
    Cc: Andrea Arcangeli
    Cc: Greg Kroah-Hartman
    Cc: Aneesh Kumar K.V
    Cc: Kirill A. Shutemov
    Cc: "Jérôme Glisse"
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Henry Willard
     

05 Jan, 2018

1 commit

  • While testing on a large CPU system, detected the following RCU stall
    many times over the span of the workload. This problem is solved by
    adding a cond_resched() in the change_pmd_range() function.

    INFO: rcu_sched detected stalls on CPUs/tasks:
    154-....: (670 ticks this GP) idle=022/140000000000000/0 softirq=2825/2825 fqs=612
    (detected by 955, t=6002 jiffies, g=4486, c=4485, q=90864)
    Sending NMI from CPU 955 to CPUs 154:
    NMI backtrace for cpu 154
    CPU: 154 PID: 147071 Comm: workload Not tainted 4.15.0-rc3+ #3
    NIP: c0000000000b3f64 LR: c0000000000b33d4 CTR: 000000000000aa18
    REGS: 00000000a4b0fb44 TRAP: 0501 Not tainted (4.15.0-rc3+)
    MSR: 8000000000009033 CR: 22422082 XER: 00000000
    CFAR: 00000000006cf8f0 SOFTE: 1
    GPR00: 0010000000000000 c00003ef9b1cb8c0 c0000000010cc600 0000000000000000
    GPR04: 8e0000018c32b200 40017b3858fd6e00 8e0000018c32b208 40017b3858fd6e00
    GPR08: 8e0000018c32b210 40017b3858fd6e00 8e0000018c32b218 40017b3858fd6e00
    GPR12: ffffffffffffffff c00000000fb25100
    NIP [c0000000000b3f64] plpar_hcall9+0x44/0x7c
    LR [c0000000000b33d4] pSeries_lpar_flush_hash_range+0x384/0x420
    Call Trace:
    flush_hash_range+0x48/0x100
    __flush_tlb_pending+0x44/0xd0
    hpte_need_flush+0x408/0x470
    change_protection_range+0xaac/0xf10
    change_prot_numa+0x30/0xb0
    task_numa_work+0x2d0/0x3e0
    task_work_run+0x130/0x190
    do_notify_resume+0x118/0x120
    ret_from_except_lite+0x70/0x74
    Instruction dump:
    60000000 f8810028 7ca42b78 7cc53378 7ce63b78 7d074378 7d284b78 7d495378
    e9410060 e9610068 e9810070 44000022 e9810028 f88c0000 f8ac0008

    Link: http://lkml.kernel.org/r/20171214140551.5794-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Suggested-by: Nicholas Piggin
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

2 commits

  • HMM (heterogeneous memory management) need struct page to support
    migration from system main memory to device memory. Reasons for HMM and
    migration to device memory is explained with HMM core patch.

    This patch deals with device memory that is un-addressable memory (ie CPU
    can not access it). Hence we do not want those struct page to be manage
    like regular memory. That is why we extend ZONE_DEVICE to support
    different types of memory.

    A persistent memory type is define for existing user of ZONE_DEVICE and a
    new device un-addressable type is added for the un-addressable memory
    type. There is a clear separation between what is expected from each
    memory type and existing user of ZONE_DEVICE are un-affected by new
    requirement and new use of the un-addressable type. All specific code
    path are protect with test against the memory type.

    Because memory is un-addressable we use a new special swap type for when a
    page is migrated to device memory (this reduces the number of maximum swap
    file).

    The main two additions beside memory type to ZONE_DEVICE is two callbacks.
    First one, page_free() is call whenever page refcount reach 1 (which
    means the page is free as ZONE_DEVICE page never reach a refcount of 0).
    This allow device driver to manage its memory and associated struct page.

    The second callback page_fault() happens when there is a CPU access to an
    address that is back by a device page (which are un-addressable by the
    CPU). This callback is responsible to migrate the page back to system
    main memory. Device driver can not block migration back to system memory,
    HMM make sure that such page can not be pin into device memory.

    If device is in some error condition and can not migrate memory back then
    a CPU page fault to device memory should end with SIGBUS.

    [arnd@arndb.de: fix warning]
    Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Arnd Bergmann
    Acked-by: Dan Williams
    Cc: Ross Zwisler
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • When THP migration is being used, memory management code needs to handle
    pmd migration entries properly. This patch uses !pmd_present() or
    is_swap_pmd() (depending on whether pmd_none() needs separate code or
    not) to check pmd migration entries at the places where a pmd entry is
    present.

    Since pmd-related code uses split_huge_page(), split_huge_pmd(),
    pmd_trans_huge(), pmd_trans_unstable(), or
    pmd_none_or_trans_huge_or_clear_bad(), this patch:

    1. adds pmd migration entry split code in split_huge_pmd(),

    2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

    3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

    Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
    is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
    them.

    Until this commit, a pmd entry should be:
    1. pointing to a pte page,
    2. is_swap_pmd(),
    3. pmd_trans_huge(),
    4. pmd_devmap(), or
    5. pmd_none().

    Signed-off-by: Zi Yan
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan