04 Sep, 2020

1 commit

  • Similarly to arch_validate_prot() called from do_mprotect_pkey(), an
    architecture may need to sanity-check the new vm_flags.

    Define a dummy function always returning true. In addition to
    do_mprotect_pkey(), also invoke it from mmap_region() prior to updating
    vma->vm_page_prot to allow the architecture code to veto potentially
    inconsistent vm_flags.

    Signed-off-by: Catalin Marinas
    Acked-by: Andrew Morton

    Catalin Marinas
     

10 Jun, 2020

3 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The include/linux/pgtable.h is going to be the home of generic page table
    manipulation functions.

    Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
    make the latter include asm/pgtable.h.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

11 Apr, 2020

1 commit

  • There are many places where all basic VMA access flags (read, write,
    exec) are initialized or checked against as a group. One such example
    is during page fault. Existing vma_is_accessible() wrapper already
    creates the notion of VMA accessibility as a group access permissions.

    Hence lets just create VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which
    will not only reduce code duplication but also extend the VMA
    accessibility concept in general.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Mark Salter
    Cc: Nick Hu
    Cc: Ley Foon Tan
    Cc: Michael Ellerman
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Guan Xuetao
    Cc: Dave Hansen
    Cc: Thomas Gleixner
    Cc: Rob Springer
    Cc: Greg Kroah-Hartman
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/1583391014-8170-3-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

08 Apr, 2020

4 commits

  • For either swap and page migration, we all use the bit 2 of the entry to
    identify whether this entry is uffd write-protected. It plays a similar
    role as the existing soft dirty bit in swap entries but only for keeping
    the uffd-wp tracking for a specific PTE/PMD.

    Something special here is that when we want to recover the uffd-wp bit
    from a swap/migration entry to the PTE bit we'll also need to take care of
    the _PAGE_RW bit and make sure it's cleared, otherwise even with the
    _PAGE_UFFD_WP bit we can't trap it at all.

    In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
    That can lead to data mismatch if the page that we are going to write
    protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch
    also applies/removes the uffd-wp bit even for the swap entries.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
    change_protection() when used with uffd-wp and make sure the two new flags
    are exclusively used. Then,

    - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
    when a range of memory is write protected by uffd

    - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
    _PAGE_RW when write protection is resolved from userspace

    And use this new interface in mwriteprotect_range() to replace the old
    MM_CP_DIRTY_ACCT.

    Do this change for both PTEs and huge PMDs. Then we can start to identify
    which PTE/PMD is write protected by general (e.g., COW or soft dirty
    tracking), and which is for userfaultfd-wp.

    Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
    into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we
    can be even more strict when detecting uffd-wp page faults in either
    do_wp_page() or wp_huge_pmd().

    After we're with _PAGE_UFFD_WP, a special case is when a page is both
    protected by the general COW logic and also userfault-wp. Here the
    userfault-wp will have higher priority and will be handled first. Only
    after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
    the general COW. These are the steps on what will happen with such a
    page:

    1. CPU accesses write protected shared page (so both protected by
    general COW and uffd-wp), blocked by uffd-wp first because in
    do_wp_page we'll handle uffd-wp first, so it has higher priority
    than general COW.

    2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
    to remove the uffd-wp bit upon the PTE/PMD. However here we
    still keep the write bit cleared. Notify the blocked CPU.

    3. The blocked CPU resumes the page fault process with a fault
    retry, during retry it'll notice it was not with the uffd-wp bit
    this time but it is still write protected by general COW, then
    it'll go though the COW path in the fault handler, copy the page,
    apply write bit where necessary, and retry again.

    4. The CPU will be able to access this page with write bit set.

    Suggested-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Brian Geffon
    Cc: Pavel Emelyanov
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: Martin Cracauer
    Cc: Mel Gorman
    Cc: Bobby Powers
    Cc: Mike Rapoport
    Cc: "Kirill A . Shutemov"
    Cc: Maya Gokhale
    Cc: Johannes Weiner
    Cc: Marty McFadden
    Cc: Denis Plotnikov
    Cc: Hugh Dickins
    Cc: "Dr . David Alan Gilbert"
    Cc: Jerome Glisse
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • change_protection() was used by either the NUMA or mprotect() code,
    there's one parameter for each of the callers (dirty_accountable and
    prot_numa). Further, these parameters are passed along the calls:

    - change_protection_range()
    - change_p4d_range()
    - change_pud_range()
    - change_pmd_range()
    - ...

    Now we introduce a flag for change_protect() and all these helpers to
    replace these parameters. Then we can avoid passing multiple parameters
    multiple times along the way.

    More importantly, it'll greatly simplify the work if we want to introduce
    any new parameters to change_protection(). In the follow up patches, a
    new parameter for userfaultfd write protection will be introduced.

    No functional change at all.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Some comments for MADV_FREE is revised and added to help people understand
    the MADV_FREE code, especially the page flag, PG_swapbacked. This makes
    page_is_file_cache() isn't consistent with its comments. So the function
    is renamed to page_is_file_lru() to make them consistent again. All these
    are put in one patch as one logical change.

    Suggested-by: David Hildenbrand
    Suggested-by: Johannes Weiner
    Suggested-by: David Rientjes
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Acked-by: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     

06 Mar, 2020

1 commit

  • : A user reported a bug against a distribution kernel while running a
    : proprietary workload described as "memory intensive that is not swapping"
    : that is expected to apply to mainline kernels. The workload is
    : read/write/modifying ranges of memory and checking the contents. They
    : reported that within a few hours that a bad PMD would be reported followed
    : by a memory corruption where expected data was all zeros. A partial
    : report of the bad PMD looked like
    :
    : [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
    : [ 5195.341184] ------------[ cut here ]------------
    : [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
    : ....
    : [ 5195.410033] Call Trace:
    : [ 5195.410471] [] change_protection_range+0x7dd/0x930
    : [ 5195.410716] [] change_prot_numa+0x18/0x30
    : [ 5195.410918] [] task_numa_work+0x1fe/0x310
    : [ 5195.411200] [] task_work_run+0x72/0x90
    : [ 5195.411246] [] exit_to_usermode_loop+0x91/0xc2
    : [ 5195.411494] [] prepare_exit_to_usermode+0x31/0x40
    : [ 5195.411739] [] retint_user+0x8/0x10
    :
    : Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
    : was a false detection. The bug does not trigger if automatic NUMA
    : balancing or transparent huge pages is disabled.
    :
    : The bug is due a race in change_pmd_range between a pmd_trans_huge and
    : pmd_nond_or_clear_bad check without any locks held. During the
    : pmd_trans_huge check, a parallel protection update under lock can have
    : cleared the PMD and filled it with a prot_numa entry between the transhuge
    : check and the pmd_none_or_clear_bad check.
    :
    : While this could be fixed with heavy locking, it's only necessary to make
    : a copy of the PMD on the stack during change_pmd_range and avoid races. A
    : new helper is created for this as the check if quite subtle and the
    : existing similar helpful is not suitable. This passed 154 hours of
    : testing (usually triggers between 20 minutes and 24 hours) without
    : detecting bad PMDs or corruption. A basic test of an autonuma-intensive
    : workload showed no significant change in behaviour.

    Although Mel withdrew the patch on the face of LKML comment
    https://lkml.org/lkml/2017/4/10/922 the race window aforementioned is
    still open, and we have reports of Linpack test reporting bad residuals
    after the bad PMD warning is observed. In addition to that, bad
    rss-counter and non-zero pgtables assertions are triggered on mm teardown
    for the task hitting the bad PMD.

    host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
    ....
    host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
    host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096

    The issue is observed on a v4.18-based distribution kernel, but the race
    window is expected to be applicable to mainline kernels, as well.

    [akpm@linux-foundation.org: fix comment typo, per Rafael]
    Signed-off-by: Andrew Morton
    Signed-off-by: Rafael Aquini
    Signed-off-by: Mel Gorman
    Cc:
    Cc: Zi Yan
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200216191800.22423-1-aquini@redhat.com
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

02 Dec, 2019

1 commit

  • In auto NUMA balancing page table scanning, if the pte_protnone() is
    true, the PTE needs not to be changed because it's in target state
    already. So other checking on corresponding struct page is unnecessary
    too.

    So, if we check pte_protnone() firstly for each PTE, we can avoid
    unnecessary struct page accessing, so that reduce the cache footprint of
    NUMA balancing page table scanning.

    In the performance test of pmbench memory accessing benchmark with 80:20
    read/write ratio and normal access address distribution on a 2 socket
    Intel server with Optance DC Persistent Memory, perf profiling shows
    that the autonuma page table scanning time reduces from 1.23% to 0.97%
    (that is, reduced 21%) with the patch.

    Link: http://lkml.kernel.org/r/20191101075727.26683-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

26 Sep, 2019

1 commit

  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    This patch allows tagged pointers to be passed to the following memory
    syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
    mremap, msync, munlock, move_pages.

    The mmap and mremap syscalls do not currently accept tagged addresses.
    Architectures may interpret the tag as a background colour for the
    corresponding vma.

    Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Khalid Aziz
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Auger
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

07 Sep, 2019

2 commits

  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

15 May, 2019

3 commits

  • Since 0cbe3e26abe0 ("mm: update ptep_modify_prot_start/commit to take
    vm_area_struct as arg") the only place that uses the local 'mm' variable
    in change_pte_range() is the call to set_pte_at().

    Many architectures define set_pte_at() as macro that does not use the 'mm'
    parameter, which generates the following compilation warning:

    CC mm/mprotect.o
    mm/mprotect.c: In function 'change_pte_range':
    mm/mprotect.c:42:20: warning: unused variable 'mm' [-Wunused-variable]
    struct mm_struct *mm = vma->vm_mm;
    ^~

    Fix it by passing vma->mm to set_pte_at() and dropping the local 'mm'
    variable in change_pte_range().

    [liu.song.a23@gmail.com: fix missed conversions]
    Link: http://lkml.kernel.org/r/CAPhsuW6wcQgYLHNdBdw6m0YiR4RWsS4XzfpSKU7wBLLeOCTbpw@mail.gmail.comLink: http://lkml.kernel.org/r/1557305432-4940-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

06 Mar, 2019

2 commits

  • Architectures like ppc64 require to do a conditional tlb flush based on
    the old and new value of pte. Enable that by passing old pte value as
    the arg.

    Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Paul Mackerras
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Patch series "NestMMU pte upgrade workaround for mprotect", v5.

    We can upgrade pte access (R -> RW transition) via mprotect. We need to
    make sure we follow the recommended pte update sequence as outlined in
    commit bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to
    handle nest MMU hang") for such updates. This patch series does that.

    This patch (of 5):

    Some architectures may want to call flush_tlb_range from these helpers.

    Link: http://lkml.kernel.org/r/20190116085035.29729-2-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Nicholas Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

29 Dec, 2018

1 commit

  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

21 Jun, 2018

1 commit

  • For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
    table entry. This sets the high bits in the CPU's address space, thus
    making sure to point to not point an unmapped entry to valid cached memory.

    Some server system BIOSes put the MMIO mappings high up in the physical
    address space. If such an high mapping was mapped to unprivileged users
    they could attack low memory by setting such a mapping to PROT_NONE. This
    could happen through a special device driver which is not access
    protected. Normal /dev/mem is of course access protected.

    To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.

    Valid page mappings are allowed because the system is then unsafe anyways.

    It's not expected that users commonly use PROT_NONE on MMIO. But to
    minimize any impact this is only enforced if the mapping actually refers to
    a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
    the check for root.

    For mmaps this is straight forward and can be handled in vm_insert_pfn and
    in remap_pfn_range().

    For mprotect it's a bit trickier. At the point where the actual PTEs are
    accessed a lot of state has been changed and it would be difficult to undo
    on an error. Since this is a uncommon case use a separate early page talk
    walk pass for MMIO PROT_NONE mappings that checks for this condition
    early. For non MMIO and non PROT_NONE there are no changes.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Dave Hansen

    Andi Kleen
     

12 Apr, 2018

1 commit

  • change_pte_range is called from task work context to mark PTEs for
    receiving NUMA faulting hints. If the marked pages are dirty then
    migration may fail. Some filesystems cannot migrate dirty pages without
    blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
    Even when they can, it can be a waste of cycles when the pages are
    shared forcing higher scan rates. This patch avoids marking shared
    dirty pages for hinting faults but also will skip a migration if the
    page was dirtied after the scanner updated a clean page.

    This is most noticeable running the NASA Parallel Benchmark when backed
    by btrfs, the default root filesystem for some distributions, but also
    noticeable when using XFS.

    The following are results from a 4-socket machine running a 4.16-rc4
    kernel with some scheduler patches that are pending for the next merge
    window.

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1
    Time cg.D 459.07 ( 0.00%) 444.21 ( 3.24%)
    Time ep.D 76.96 ( 0.00%) 77.69 ( -0.95%)
    Time is.D 25.55 ( 0.00%) 27.85 ( -9.00%)
    Time lu.D 601.58 ( 0.00%) 596.87 ( 0.78%)
    Time mg.D 107.73 ( 0.00%) 108.22 ( -0.45%)

    is.D regresses slightly in terms of absolute time but note that that
    particular load varies quite a bit from run to run. The more relevant
    observation is the total system CPU usage.

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1
    User 71471.91 70627.04
    System 11078.96 8256.13
    Elapsed 661.66 632.74

    That is a substantial drop in system CPU usage and overall the workload
    completes faster. The NUMA balancing statistics are also interesting

    NUMA base PTE updates 111407972 139848884
    NUMA huge PMD updates 206506 264869
    NUMA page range updates 217139044 275461812
    NUMA hint faults 4300924 3719784
    NUMA hint local faults 3012539 3416618
    NUMA hint local percent 70 91
    NUMA pages migrated 1517487 1358420

    While more PTEs are scanned due to changes in what faults are gathered,
    it's clear that a far higher percentage of faults are local as the bulk
    of the remote hits were dirty pages that, in this case with btrfs, had
    no chance of migrating.

    The following is a comparison when using XFS as that is a more realistic
    filesystem choice for a data partition

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1r47
    Time cg.D 485.28 ( 0.00%) 442.62 ( 8.79%)
    Time ep.D 77.68 ( 0.00%) 77.54 ( 0.18%)
    Time is.D 26.44 ( 0.00%) 24.79 ( 6.24%)
    Time lu.D 597.46 ( 0.00%) 597.11 ( 0.06%)
    Time mg.D 142.65 ( 0.00%) 105.83 ( 25.81%)

    That is a reasonable gain on two relatively long-lived workloads. While
    not presented, there is also a substantial drop in system CPu usage and
    the NUMA balancing stats show similar improvements in locality as btrfs
    did.

    Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 Mar, 2018

2 commits

  • When protection bits are changed on a VMA, some of the architecture
    specific flags should be cleared as well. An examples of this are the
    PKEY flags on x86. This patch expands the current code that clears
    PKEY flags for x86, to support similar functionality for other
    architectures as well.

    Signed-off-by: Khalid Aziz
    Cc: Khalid Aziz
    Reviewed-by: Anthony Yznaga
    Acked-by: Andrew Morton
    Signed-off-by: David S. Miller

    Khalid Aziz
     
  • A protection flag may not be valid across entire address space and
    hence arch_validate_prot() might need the address a protection bit is
    being set on to ensure it is a valid protection flag. For example, sparc
    processors support memory corruption detection (as part of ADI feature)
    flag on memory addresses mapped on to physical RAM but not on PFN mapped
    pages or addresses mapped on to devices. This patch adds address to the
    parameters being passed to arch_validate_prot() so protection bits can
    be validated in the relevant context.

    Signed-off-by: Khalid Aziz
    Cc: Khalid Aziz
    Reviewed-by: Anthony Yznaga
    Acked-by: Michael Ellerman (powerpc)
    Acked-by: Andrew Morton
    Signed-off-by: David S. Miller

    Khalid Aziz
     

01 Feb, 2018

1 commit

  • Workloads consisting of a large number of processes running the same
    program with a very large shared data segment may experience performance
    problems when numa balancing attempts to migrate the shared cow pages.
    This manifests itself with many processes or tasks in
    TASK_UNINTERRUPTIBLE state waiting for the shared pages to be migrated.

    The program listed below simulates the conditions with these results
    when run with 288 processes on a 144 core/8 socket machine.

    Average throughput Average throughput Average throughput
    with numa_balancing=0 with numa_balancing=1 with numa_balancing=1
    without the patch with the patch
    --------------------- --------------------- ---------------------
    2118782 2021534 2107979

    Complex production environments show less variability and fewer poorly
    performing outliers accompanied with a smaller number of processes
    waiting on NUMA page migration with this patch applied. In some cases,
    %iowait drops from 16%-26% to 0.

    // SPDX-License-Identifier: GPL-2.0
    /*
    * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
    */
    #include
    #include
    #include
    #include

    int a[1000000] = {13};

    int main(int argc, const char **argv)
    {
    int n = 0;
    int i;
    pid_t pid;
    int stat;
    int *count_array;
    int cpu_count = 288;
    long total = 0;

    struct timeval t1, t2 = {(argc > 1 ? atoi(argv[1]) : 10), 0};

    if (argc > 2)
    cpu_count = atoi(argv[2]);

    count_array = mmap(NULL, cpu_count * sizeof(int),
    (PROT_READ|PROT_WRITE),
    (MAP_SHARED|MAP_ANONYMOUS), 0, 0);

    if (count_array == MAP_FAILED) {
    perror("mmap:");
    return 0;
    }

    for (i = 0; i < cpu_count; ++i) {
    pid = fork();
    if (pid < 0)
    break;
    }

    for (i = 0; i < cpu_count; ++i)
    total += count_array[i];

    printf("Total %ld\n", total);
    munmap(count_array, cpu_count * sizeof(int));
    return 0;
    }

    gettimeofday(&t1, 0);
    timeradd(&t1, &t2, &t1);
    while (timercmp(&t2, &t1, < 1000000; j++)
    b += a[j];
    gettimeofday(&t2, 0);
    n++;
    }
    count_array[i] = n;
    return 0;
    }

    This patch changes change_pte_range() to skip shared copy-on-write pages
    when called from change_prot_numa().

    NOTE: change_prot_numa() is nominally called from task_numa_work() and
    queue_pages_test_walk(). task_numa_work() is the auto NUMA balancing
    path, and queue_pages_test_walk() is part of explicit NUMA policy
    management. However, queue_pages_test_walk() only calls
    change_prot_numa() when MPOL_MF_LAZY is specified and currently that is
    not allowed, so change_prot_numa() is only called from auto NUMA
    balancing.

    In the case of explicit NUMA policy management, shared pages are not
    migrated unless MPOL_MF_MOVE_ALL is specified, and MPOL_MF_MOVE_ALL
    depends on CAP_SYS_NICE. Currently, there is no way to pass information
    about MPOL_MF_MOVE_ALL to change_pte_range. This will have to be fixed
    if MPOL_MF_LAZY is enabled and MPOL_MF_MOVE_ALL is to be honored in lazy
    migration mode.

    task_numa_work() skips the read-only VMAs of programs and shared
    libraries.

    Link: http://lkml.kernel.org/r/1516751617-7369-1-git-send-email-henry.willard@oracle.com
    Signed-off-by: Henry Willard
    Reviewed-by: Håkon Bugge
    Reviewed-by: Steve Sistare
    Acked-by: Mel Gorman
    Cc: Kate Stewart
    Cc: Zi Yan
    Cc: Philippe Ombredanne
    Cc: Andrea Arcangeli
    Cc: Greg Kroah-Hartman
    Cc: Aneesh Kumar K.V
    Cc: Kirill A. Shutemov
    Cc: "Jérôme Glisse"
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Henry Willard
     

05 Jan, 2018

1 commit

  • While testing on a large CPU system, detected the following RCU stall
    many times over the span of the workload. This problem is solved by
    adding a cond_resched() in the change_pmd_range() function.

    INFO: rcu_sched detected stalls on CPUs/tasks:
    154-....: (670 ticks this GP) idle=022/140000000000000/0 softirq=2825/2825 fqs=612
    (detected by 955, t=6002 jiffies, g=4486, c=4485, q=90864)
    Sending NMI from CPU 955 to CPUs 154:
    NMI backtrace for cpu 154
    CPU: 154 PID: 147071 Comm: workload Not tainted 4.15.0-rc3+ #3
    NIP: c0000000000b3f64 LR: c0000000000b33d4 CTR: 000000000000aa18
    REGS: 00000000a4b0fb44 TRAP: 0501 Not tainted (4.15.0-rc3+)
    MSR: 8000000000009033 CR: 22422082 XER: 00000000
    CFAR: 00000000006cf8f0 SOFTE: 1
    GPR00: 0010000000000000 c00003ef9b1cb8c0 c0000000010cc600 0000000000000000
    GPR04: 8e0000018c32b200 40017b3858fd6e00 8e0000018c32b208 40017b3858fd6e00
    GPR08: 8e0000018c32b210 40017b3858fd6e00 8e0000018c32b218 40017b3858fd6e00
    GPR12: ffffffffffffffff c00000000fb25100
    NIP [c0000000000b3f64] plpar_hcall9+0x44/0x7c
    LR [c0000000000b33d4] pSeries_lpar_flush_hash_range+0x384/0x420
    Call Trace:
    flush_hash_range+0x48/0x100
    __flush_tlb_pending+0x44/0xd0
    hpte_need_flush+0x408/0x470
    change_protection_range+0xaac/0xf10
    change_prot_numa+0x30/0xb0
    task_numa_work+0x2d0/0x3e0
    task_work_run+0x130/0x190
    do_notify_resume+0x118/0x120
    ret_from_except_lite+0x70/0x74
    Instruction dump:
    60000000 f8810028 7ca42b78 7cc53378 7ce63b78 7d074378 7d284b78 7d495378
    e9410060 e9610068 e9810070 44000022 e9810028 f88c0000 f8ac0008

    Link: http://lkml.kernel.org/r/20171214140551.5794-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Suggested-by: Nicholas Piggin
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

2 commits

  • HMM (heterogeneous memory management) need struct page to support
    migration from system main memory to device memory. Reasons for HMM and
    migration to device memory is explained with HMM core patch.

    This patch deals with device memory that is un-addressable memory (ie CPU
    can not access it). Hence we do not want those struct page to be manage
    like regular memory. That is why we extend ZONE_DEVICE to support
    different types of memory.

    A persistent memory type is define for existing user of ZONE_DEVICE and a
    new device un-addressable type is added for the un-addressable memory
    type. There is a clear separation between what is expected from each
    memory type and existing user of ZONE_DEVICE are un-affected by new
    requirement and new use of the un-addressable type. All specific code
    path are protect with test against the memory type.

    Because memory is un-addressable we use a new special swap type for when a
    page is migrated to device memory (this reduces the number of maximum swap
    file).

    The main two additions beside memory type to ZONE_DEVICE is two callbacks.
    First one, page_free() is call whenever page refcount reach 1 (which
    means the page is free as ZONE_DEVICE page never reach a refcount of 0).
    This allow device driver to manage its memory and associated struct page.

    The second callback page_fault() happens when there is a CPU access to an
    address that is back by a device page (which are un-addressable by the
    CPU). This callback is responsible to migrate the page back to system
    main memory. Device driver can not block migration back to system memory,
    HMM make sure that such page can not be pin into device memory.

    If device is in some error condition and can not migrate memory back then
    a CPU page fault to device memory should end with SIGBUS.

    [arnd@arndb.de: fix warning]
    Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Arnd Bergmann
    Acked-by: Dan Williams
    Cc: Ross Zwisler
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • When THP migration is being used, memory management code needs to handle
    pmd migration entries properly. This patch uses !pmd_present() or
    is_swap_pmd() (depending on whether pmd_none() needs separate code or
    not) to check pmd migration entries at the places where a pmd entry is
    present.

    Since pmd-related code uses split_huge_page(), split_huge_pmd(),
    pmd_trans_huge(), pmd_trans_unstable(), or
    pmd_none_or_trans_huge_or_clear_bad(), this patch:

    1. adds pmd migration entry split code in split_huge_pmd(),

    2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

    3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

    Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
    is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
    them.

    Until this commit, a pmd entry should be:
    1. pointing to a pte page,
    2. is_swap_pmd(),
    3. pmd_trans_huge(),
    4. pmd_devmap(), or
    5. pmd_none().

    Signed-off-by: Zi Yan
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     

11 Aug, 2017

1 commit

  • Patch series "fixes of TLB batching races", v6.

    It turns out that Linux TLB batching mechanism suffers from various
    races. Races that are caused due to batching during reclamation were
    recently handled by Mel and this patch-set deals with others. The more
    fundamental issue is that concurrent updates of the page-tables allow
    for TLB flushes to be batched on one core, while another core changes
    the page-tables. This other core may assume a PTE change does not
    require a flush based on the updated PTE value, while it is unaware that
    TLB flushes are still pending.

    This behavior affects KSM (which may result in memory corruption) and
    MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
    proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
    Memory corruption in KSM is harder to produce in practice, but was
    observed by hacking the kernel and adding a delay before flushing and
    replacing the KSM page.

    Finally, there is also one memory barrier missing, which may affect
    architectures with weak memory model.

    This patch (of 7):

    Setting and clearing mm->tlb_flush_pending can be performed by multiple
    threads, since mmap_sem may only be acquired for read in
    task_numa_work(). If this happens, tlb_flush_pending might be cleared
    while one of the threads still changes PTEs and batches TLB flushes.

    This can lead to the same race between migration and
    change_protection_range() that led to the introduction of
    tlb_flush_pending. The result of this race was data corruption, which
    means that this patch also addresses a theoretically possible data
    corruption.

    An actual data corruption was not observed, yet the race was was
    confirmed by adding assertion to check tlb_flush_pending is not set by
    two threads, adding artificial latency in change_protection_range() and
    using sysctl to reduce kernel.numa_balancing_scan_delay_ms.

    Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
    Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
    change_protection_range")
    Signed-off-by: Nadav Amit
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: "David S. Miller"
    Cc: Andrea Arcangeli
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Jeff Dike
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Russell King
    Cc: Sergey Senozhatsky
    Cc: Tony Luck
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     

03 Aug, 2017

1 commit

  • Nadav Amit identified a theoritical race between page reclaim and
    mprotect due to TLB flushes being batched outside of the PTL being held.

    He described the race as follows:

    CPU0 CPU1
    ---- ----
    user accesses memory using RW PTE
    [PTE now cached in TLB]
    try_to_unmap_one()
    ==> ptep_get_and_clear()
    ==> set_tlb_ubc_flush_pending()
    mprotect(addr, PROT_READ)
    ==> change_pte_range()
    ==> [ PTE non-present - no flush ]

    user writes using cached RW PTE
    ...

    try_to_unmap_flush()

    The same type of race exists for reads when protecting for PROT_NONE and
    also exists for operations that can leave an old TLB entry behind such
    as munmap, mremap and madvise.

    For some operations like mprotect, it's not necessarily a data integrity
    issue but it is a correctness issue as there is a window where an
    mprotect that limits access still allows access. For munmap, it's
    potentially a data integrity issue although the race is massive as an
    munmap, mmap and return to userspace must all complete between the
    window when reclaim drops the PTL and flushes the TLB. However, it's
    theoritically possible so handle this issue by flushing the mm if
    reclaim is potentially currently batching TLB flushes.

    Other instances where a flush is required for a present pte should be ok
    as either the page lock is held preventing parallel reclaim or a page
    reference count is elevated preventing a parallel free leading to
    corruption. In the case of page_mkclean there isn't an obvious path
    that userspace could take advantage of without using the operations that
    are guarded by this patch. Other users such as gup as a race with
    reclaim looks just at PTEs. huge page variants should be ok as they
    don't race with reclaim. mincore only looks at PTEs. userfault also
    should be ok as if a parallel reclaim takes place, it will either fault
    the page back in or read some of the data before the flush occurs
    triggering a fault.

    Note that a variant of this patch was acked by Andy Lutomirski but this
    was for the x86 parts on top of his PCID work which didn't make the 4.13
    merge window as expected. His ack is dropped from this version and
    there will be a follow-on patch on top of PCID that will include his
    ack.

    [akpm@linux-foundation.org: tweak comments]
    [akpm@linux-foundation.org: fix spello]
    Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
    Reported-by: Nadav Amit
    Signed-off-by: Mel Gorman
    Cc: Andy Lutomirski
    Cc: [v4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

07 Jul, 2017

1 commit

  • pte_offset_map_lock() finds and takes ptl, and returns pte. But some
    callers return without unlocking the ptl when pte == NULL, which seems
    weird.

    Git history said that !pte check in change_pte_range() was introduced in
    commit 1ad9f620c3a2 ("mm: numa: recheck for transhuge pages under lock
    during protection changes") and still remains after commit 175ad4f1e7a2
    ("mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock")
    which partially reverts 1ad9f620c3a2. So I think that it's just dead
    code.

    Many other caller of pte_offset_map_lock() never check NULL return, so
    let's do likewise.

    Link: http://lkml.kernel.org/r/1495089737-1292-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

10 Mar, 2017

1 commit


25 Feb, 2017

1 commit

  • Patch series "Numabalancing preserve write fix", v2.

    This patch series address an issue w.r.t THP migration and autonuma
    preserve write feature. migrate_misplaced_transhuge_page() cannot deal
    with concurrent modification of the page. It does a page copy without
    following the migration pte sequence. IIUC, this was done to keep the
    migration simpler and at the time of implemenation we didn't had THP
    page cache which would have required a more elaborate migration scheme.
    That means thp autonuma migration expect the protnone with saved write
    to be done such that both kernel and user cannot update the page
    content. This patch series enables archs like ppc64 to do that. We are
    good with the hash translation mode with the current code, because we
    never create a hardware page table entry for a protnone pte.

    This patch (of 2):

    Autonuma preserves the write permission across numa fault to avoid
    taking a writefault after a numa fault (Commit: b191f9b106ea " mm: numa:
    preserve PTE write permissions across a NUMA hinting fault").
    Architecture can implement protnone in different ways and some may
    choose to implement that by clearing Read/ Write/Exec bit of pte.
    Setting the write bit on such pte can result in wrong behaviour. Fix
    this up by allowing arch to override how to save the write bit on a
    protnone pte.

    [aneesh.kumar@linux.vnet.ibm.com: don't mark pte saved write in case of dirty_accountable]
    Link: http://lkml.kernel.org/r/1487942884-16517-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    [aneesh.kumar@linux.vnet.ibm.com: v3]
    Link: http://lkml.kernel.org/r/1487498625-10891-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1487050314-3892-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Michael Neuling
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

23 Feb, 2017

1 commit

  • pmd_trans_unstable does an atomic read on the pmd so it doesn't require
    the pmd_lock for the same check.

    This also removes the special assumption that the mmap_sem is hold for
    writing if prot_numa is not set. userfaultfd will hold the mmap_sem
    only for reading in change_pte_range like prot_numa, but it will not set
    prot_numa.

    This is always a valid micro-optimization regardless of userfaultfd.

    [kirill@shutemov.name: drop unneeded pmd_trans_unstable(pmd) check after __split_huge_pmd()]
    Link: http://lkml.kernel.org/r/20170208120421.GE5578@node.shutemov.name
    Link: http://lkml.kernel.org/r/20161216144821.5183-43-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Dec, 2016

1 commit


13 Dec, 2016

3 commits

  • Having code for the pkey_mprotect, pkey_alloc and pkey_free system calls
    makes only sense if ARCH_HAS_PKEYS is selected. If not selected these
    system calls will always return -ENOSPC or -EINVAL.

    To simplify things and have less code generate the pkey system call code
    only if ARCH_HAS_PKEYS is selected.

    For architectures which have already wired up the system calls, but do
    not select ARCH_HAS_PKEYS this will result in less generated code and a
    different return code: the three system calls will now always return
    -ENOSYS, using the cond_syscall mechanism.

    For architectures which have not wired up the system calls less
    unreachable code will be generated.

    Link: http://lkml.kernel.org/r/20161114111251.70084-1-heiko.carstens@de.ibm.com
    Signed-off-by: Heiko Carstens
    Acked-by: Dave Hansen
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • While doing MADV_DONTNEED on a large area of thp memory, I noticed we
    encountered many unlikely() branches in profiles for each backing
    hugepage. This is because zap_pmd_range() would call split_huge_pmd(),
    which rechecked the conditions that were already validated, but as part
    of an unlikely() branch.

    Avoid the unlikely() branch when in a context where pmd is known to be
    good for __split_huge_pmd() directly.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1610181600300.84525@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • We had some problems with pages getting unmapped in single threaded
    affinitized processes. It was tracked down to NUMA scanning.

    In this case it doesn't make any sense to unmap pages if the process is
    single threaded and the page is already on the node the process is
    running on.

    Add a check for this case into the numa protection code, and skip
    unmapping if true.

    In theory the process could be migrated later, but we will eventually
    rescan and unmap and migrate then.

    In theory this could be made more fancy: remembering this state per
    process or even whole mm. However that would need extra tracking and be
    more complicated, and the simple check seems to work fine so far.

    [ak@linux.intel.com: v3: Minor updates from Mel. Change code layout]
    Link: http://lkml.kernel.org/r/1476382117-5440-1-git-send-email-andi@firstfloor.org
    Link: http://lkml.kernel.org/r/1476288949-20970-1-git-send-email-andi@firstfloor.org
    Signed-off-by: Andi Kleen
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

19 Oct, 2016

1 commit