15 May, 2019

3 commits

  • Patch series "mm: Use vm_map_pages() and vm_map_pages_zero() API", v5.

    This patch (of 5):

    Previouly drivers have their own way of mapping range of kernel
    pages/memory into user vma and this was done by invoking vm_insert_page()
    within a loop.

    As this pattern is common across different drivers, it can be generalized
    by creating new functions and using them across the drivers.

    vm_map_pages() is the API which can be used to map kernel memory/pages in
    drivers which have considered vm_pgoff

    vm_map_pages_zero() is the API which can be used to map a range of kernel
    memory/pages in drivers which have not considered vm_pgoff. vm_pgoff is
    passed as default 0 for those drivers.

    We _could_ then at a later "fix" these drivers which are using
    vm_map_pages_zero() to behave according to the normal vm_pgoff offsetting
    simply by removing the _zero suffix on the function name and if that
    causes regressions, it gives us an easy way to revert.

    Tested on Rockchip hardware and display is working, including talking to
    Lima via prime.

    Link: http://lkml.kernel.org/r/751cb8a0f4c3e67e95c58a3b072937617f338eea.1552921225.git.jrdr.linux@gmail.com
    Signed-off-by: Souptick Joarder
    Suggested-by: Russell King
    Suggested-by: Matthew Wilcox
    Reviewed-by: Mike Rapoport
    Tested-by: Heiko Stuebner
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: Robin Murphy
    Cc: Joonsoo Kim
    Cc: Thierry Reding
    Cc: Kees Cook
    Cc: Marek Szyprowski
    Cc: Stefan Richter
    Cc: Sandy Huang
    Cc: David Airlie
    Cc: Oleksandr Andrushchenko
    Cc: Joerg Roedel
    Cc: Pawel Osciak
    Cc: Kyungmin Park
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Mauro Carvalho Chehab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

08 May, 2019

1 commit

  • Pull printk updates from Petr Mladek:

    - Allow state reset of printk_once() calls.

    - Prevent crashes when dereferencing invalid pointers in vsprintf().
    Only the first byte is checked for simplicity.

    - Make vsprintf warnings consistent and inlined.

    - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf
    modifiers.

    - Some clean up of vsprintf and test_printf code.

    * tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
    lib/vsprintf: Make function pointer_string static
    vsprintf: Limit the length of inlined error messages
    vsprintf: Avoid confusion between invalid address and value
    vsprintf: Prevent crash when dereferencing invalid pointers
    vsprintf: Consolidate handling of unknown pointer specifiers
    vsprintf: Factor out %pO handler as kobject_string()
    vsprintf: Factor out %pV handler as va_format()
    vsprintf: Factor out %p[iI] handler as ip_addr_string()
    vsprintf: Do not check address of well-known strings
    vsprintf: Consistent %pK handling for kptr_restrict == 0
    vsprintf: Shuffle restricted_pointer()
    printk: Tie printk_once / printk_deferred_once into .data.once for reset
    treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
    lib/test_printf: Switch to bitmap_zalloc()

    Linus Torvalds
     

09 Apr, 2019

1 commit

  • %pF and %pf are functionally equivalent to %pS and %ps conversion
    specifiers. The former are deprecated, therefore switch the current users
    to use the preferred variant.

    The changes have been produced by the following command:

    git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
    while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done

    And verifying the result.

    Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
    Cc: Andy Shevchenko
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: sparclinux@vger.kernel.org
    Cc: linux-um@lists.infradead.org
    Cc: xen-devel@lists.xenproject.org
    Cc: linux-acpi@vger.kernel.org
    Cc: linux-pm@vger.kernel.org
    Cc: drbd-dev@lists.linbit.com
    Cc: linux-block@vger.kernel.org
    Cc: linux-mmc@vger.kernel.org
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-pci@vger.kernel.org
    Cc: linux-scsi@vger.kernel.org
    Cc: linux-btrfs@vger.kernel.org
    Cc: linux-f2fs-devel@lists.sourceforge.net
    Cc: linux-mm@kvack.org
    Cc: ceph-devel@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Signed-off-by: Sakari Ailus
    Acked-by: David Sterba (for btrfs)
    Acked-by: Mike Rapoport (for mm/memblock.c)
    Acked-by: Bjorn Helgaas (for drivers/pci)
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Petr Mladek

    Sakari Ailus
     

03 Apr, 2019

2 commits

  • As the comment notes; it is a potentially dangerous operation. Just
    use tlb_flush_mmu(), that will skip the (double) TLB invalidate if
    it really isn't needed anyway.

    No change in behavior intended.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Move the mmu_gather::page_size things into the generic code instead of
    PowerPC specific bits.

    No change in behavior intended.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

30 Mar, 2019

1 commit

  • Aneesh has reported that PPC triggers the following warning when
    excercising DAX code:

    IP set_pte_at+0x3c/0x190
    LR insert_pfn+0x208/0x280
    Call Trace:
    insert_pfn+0x68/0x280
    dax_iomap_pte_fault.isra.7+0x734/0xa40
    __xfs_filemap_fault+0x280/0x2d0
    do_wp_page+0x48c/0xa40
    __handle_mm_fault+0x8d0/0x1fd0
    handle_mm_fault+0x140/0x250
    __do_page_fault+0x300/0xd60
    handle_page_fault+0x18

    Now that is WARN_ON in set_pte_at which is

    VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));

    The problem is that on some architectures set_pte_at() cannot cope with
    a situation where there is already some (different) valid entry present.

    Use ptep_set_access_flags() instead to modify the pfn which is built to
    deal with modifying existing PTE.

    Link: http://lkml.kernel.org/r/20190311084537.16029-1-jack@suse.cz
    Fixes: b2770da64254 "mm: add vm_insert_mixed_mkwrite()"
    Signed-off-by: Jan Kara
    Reported-by: "Aneesh Kumar K.V"
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Dan Williams
    Cc: Chandan Rajendra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

06 Mar, 2019

9 commits

  • LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
    This is a stress test, where one thread mmaps/writes/munmaps memory area
    and other thread is trying to read from it:

    CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
    Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
    Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
    Call Trace:
    ([] (null))
    [] lock_acquire+0xec/0x258
    [] _raw_spin_lock_bh+0x5c/0x98
    [] page_table_free+0x48/0x1a8
    [] do_fault+0xdc/0x670
    [] __handle_mm_fault+0x416/0x5f0
    [] handle_mm_fault+0x1b0/0x320
    [] do_dat_exception+0x19c/0x2c8
    [] pgm_check_handler+0x19e/0x200

    page_table_free() is called with NULL mm parameter, but because "0" is a
    valid address on s390 (see S390_lowcore), it keeps going until it
    eventually crashes in lockdep's lock_acquire. This crash is
    reproducible at least since 4.14.

    Problem is that "vmf->vma" used in do_fault() can become stale. Because
    mmap_sem may be released, other threads can come in, call munmap() and
    cause "vma" be returned to kmem cache, and get zeroed/re-initialized and
    re-used:

    handle_mm_fault |
    __handle_mm_fault |
    do_fault |
    vma = vmf->vma |
    do_read_fault |
    __do_fault |
    vma->vm_ops->fault(vmf); |
    mmap_sem is released |
    |
    | do_munmap()
    | remove_vma_list()
    | remove_vma()
    | vm_area_free()
    | # vma is released
    | ...
    | # same vma is allocated
    | # from kmem cache
    | do_mmap()
    | vm_area_alloc()
    | memset(vma, 0, ...)
    |
    pte_free(vma->vm_mm, ...); |
    page_table_free |
    spin_lock_bh(&mm->context.lock);|
    |

    Cache mm_struct to avoid using potentially stale "vma".

    [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c

    Link: http://lkml.kernel.org/r/5b3fdf19e2a5be460a384b936f5b56e13733f1b8.1551595137.git.jstancek@redhat.com
    Signed-off-by: Jan Stancek
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Matthew Wilcox
    Acked-by: Rafael Aquini
    Reviewed-by: Minchan Kim
    Acked-by: Kirill A. Shutemov
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Huang Ying
    Cc: Souptick Joarder
    Cc: Jerome Glisse
    Cc: Aneesh Kumar K.V
    Cc: David Hildenbrand
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Stancek
     
  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Architectures like ppc64 require to do a conditional tlb flush based on
    the old and new value of pte. Enable that by passing old pte value as
    the arg.

    Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Paul Mackerras
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Patch series "NestMMU pte upgrade workaround for mprotect", v5.

    We can upgrade pte access (R -> RW transition) via mprotect. We need to
    make sure we follow the recommended pte update sequence as outlined in
    commit bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to
    handle nest MMU hang") for such updates. This patch series does that.

    This patch (of 5):

    Some architectures may want to call flush_tlb_range from these helpers.

    Link: http://lkml.kernel.org/r/20190116085035.29729-2-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Nicholas Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • When calling debugfs functions, there is no need to ever check the
    return value. The function can work or not, but the code logic should
    never do something different based on this.

    Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org
    Signed-off-by: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Kroah-Hartman
     
  • Pages which use page_type must never be mapped to userspace as it would
    destroy their page type. Add an explicit check for this instead of
    assuming that kernel drivers always get this right.

    Link: http://lkml.kernel.org/r/20190129053830.3749-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Kees Cook
    Reviewed-by: David Hildenbrand
    Cc: Michael Ellerman
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • It's never appropriate to map a page allocated by SLAB into userspace.
    A buggy device driver might try this, or an attacker might be able to
    find a way to make it happen.

    Christoph said:

    : Let's just fail the code. Currently this may work with SLUB. But SLAB
    : and SLOB overlay fields with mapcount. So you would have a corrupted page
    : struct if you mapped a slab page to user space.

    Link: http://lkml.kernel.org/r/20190125173827.2658-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Kees Cook
    Acked-by: Pekka Enberg
    Cc: Rik van Riel
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Add an optimization for KSM pages almost in the same way that we have
    for ordinary anonymous pages. If there is a write fault in a page,
    which is mapped to an only pte, and it is not related to swap cache; the
    page may be reused without copying its content.

    [ Note that we do not consider PageSwapCache() pages at least for now,
    since we don't want to complicate __get_ksm_page(), which has nice
    optimization based on this (for the migration case). Currenly it is
    spinning on PageSwapCache() pages, waiting for when they have
    unfreezed counters (i.e., for the migration finish). But we don't want
    to make it also spinning on swap cache pages, which we try to reuse,
    since there is not a very high probability to reuse them. So, for now
    we do not consider PageSwapCache() pages at all. ]

    So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
    page_stable_node(), to skip a page, which KSM is currently trying to
    link to stable tree. Then we do page_ref_freeze() to prohibit KSM to
    merge one more page into the page, we are reusing. After that, nobody
    can refer to the reusing page: KSM skips !PageSwapCache() pages with
    zero refcount; and the protection against of all other participants is
    the same as for reused ordinary anon pages pte lock, page lock and
    mmap_sem.

    [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
    Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Reviewed-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Christian Koenig
    Cc: Claudio Imbrenda
    Cc: Rik van Riel
    Cc: Huang Ying
    Cc: Minchan Kim
    Cc: Kirill Tkhai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

    All these places for replacement were found by running the following
    grep patterns on the entire kernel code. Please let me know if this
    might have missed some instances. This might also have replaced some
    false positives. I will appreciate suggestions, inputs and review.

    1. git grep "nid == -1"
    2. git grep "node == -1"
    3. git grep "nid = -1"
    4. git grep "node = -1"

    This patch (of 2):

    At present there are multiple places where invalid node number is
    encoded as -1. Even though implicitly understood it is always better to
    have macros in there. Replace these open encodings for an invalid node
    number with the global macro NUMA_NO_NODE. This helps remove NUMA
    related assumptions like 'invalid node' from various places redirecting
    them to a common definition.

    Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: David Hildenbrand
    Acked-by: Jeff Kirsher [ixgbe]
    Acked-by: Jens Axboe [mtip32xx]
    Acked-by: Vinod Koul [dmaengine.c]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Doug Ledford [drivers/infiniband]
    Cc: Joseph Qi
    Cc: Hans Verkuil
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

09 Jan, 2019

2 commits

  • One of the paths in follow_pte_pmd() initialised the mmu_notifier_range
    incorrectly.

    Link: http://lkml.kernel.org/r/20190103002126.GM6310@bombadil.infradead.org
    Fixes: ac46d4f3c432 ("mm/mmu_notifier: use structure for invalidate_range_start/end calls v2")
    Signed-off-by: Matthew Wilcox
    Tested-by: Dave Chinner
    Reviewed-by: Jérôme Glisse
    Cc: John Hubbard
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
    ext4 writeback

    task1:
    wait_on_page_bit+0x82/0xa0
    shrink_page_list+0x907/0x960
    shrink_inactive_list+0x2c7/0x680
    shrink_node_memcg+0x404/0x830
    shrink_node+0xd8/0x300
    do_try_to_free_pages+0x10d/0x330
    try_to_free_mem_cgroup_pages+0xd5/0x1b0
    try_charge+0x14d/0x720
    memcg_kmem_charge_memcg+0x3c/0xa0
    memcg_kmem_charge+0x7e/0xd0
    __alloc_pages_nodemask+0x178/0x260
    alloc_pages_current+0x95/0x140
    pte_alloc_one+0x17/0x40
    __pte_alloc+0x1e/0x110
    alloc_set_pte+0x5fe/0xc20
    do_fault+0x103/0x970
    handle_mm_fault+0x61e/0xd10
    __do_page_fault+0x252/0x4d0
    do_page_fault+0x30/0x80
    page_fault+0x28/0x30

    task2:
    __lock_page+0x86/0xa0
    mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
    ext4_writepages+0x479/0xd60
    do_writepages+0x1e/0x30
    __writeback_single_inode+0x45/0x320
    writeback_sb_inodes+0x272/0x600
    __writeback_inodes_wb+0x92/0xc0
    wb_writeback+0x268/0x300
    wb_workfn+0xb4/0x390
    process_one_work+0x189/0x420
    worker_thread+0x4e/0x4b0
    kthread+0xe6/0x100
    ret_from_fork+0x41/0x50

    He adds
    "task1 is waiting for the PageWriteback bit of the page that task2 has
    collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
    LOCKED bit the page which tasks1 has locked"

    More precisely task1 is handling a page fault and it has a page locked
    while it charges a new page table to a memcg. That in turn hits a
    memory limit reclaim and the memcg reclaim for legacy controller is
    waiting on the writeback but that is never going to finish because the
    writeback itself is waiting for the page locked in the #PF path. So
    this is essentially ABBA deadlock:

    lock_page(A)
    SetPageWriteback(A)
    unlock_page(A)
    lock_page(B)
    lock_page(B)
    pte_alloc_pne
    shrink_page_list
    wait_on_page_writeback(A)
    SetPageWriteback(B)
    unlock_page(B)

    # flush A, B to clear the writeback

    This accumulating of more pages to flush is used by several filesystems
    to generate a more optimal IO patterns.

    Waiting for the writeback in legacy memcg controller is a workaround for
    pre-mature OOM killer invocations because there is no dirty IO
    throttling available for the controller. There is no easy way around
    that unfortunately. Therefore fix this specific issue by pre-allocating
    the page table outside of the page lock. We have that handy
    infrastructure for that already so simply reuse the fault-around pattern
    which already does this.

    There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
    from under a fs page locked but they should be really rare. I am not
    aware of a better solution unfortunately.

    [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@kernel.org: enhance comment, per Johannes]
    Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
    Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
    Signed-off-by: Michal Hocko
    Reported-by: Liu Bo
    Debugged-by: Liu Bo
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Reviewed-by: Liu Bo
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Vladimir Davydov
    Cc: Shakeel Butt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Jan, 2019

1 commit

  • Patch series "Add support for fast mremap".

    This series speeds up the mremap(2) syscall by copying page tables at
    the PMD level even for non-THP systems. There is concern that the extra
    'address' argument that mremap passes to pte_alloc may do something
    subtle architecture related in the future that may make the scheme not
    work. Also we find that there is no point in passing the 'address' to
    pte_alloc since its unused. This patch therefore removes this argument
    tree-wide resulting in a nice negative diff as well. Also ensuring
    along the way that the enabled architectures do not do anything funky
    with the 'address' argument that goes unnoticed by the optimization.

    Build and boot tested on x86-64. Build tested on arm64. The config
    enablement patch for arm64 will be posted in the future after more
    testing.

    The changes were obtained by applying the following Coccinelle script.
    (thanks Julia for answering all Coccinelle questions!).
    Following fix ups were done manually:
    * Removal of address argument from pte_fragment_alloc
    * Removal of pte_alloc_one_fast definitions from m68k and microblaze.

    // Options: --include-headers --no-includes
    // Note: I split the 'identifier fn' line, so if you are manually
    // running it, please unsplit it so it runs for you.

    virtual patch

    @pte_alloc_func_def depends on patch exists@
    identifier E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    type T2;
    @@

    fn(...
    - , T2 E2
    )
    { ... }

    @pte_alloc_func_proto_noarg depends on patch exists@
    type T1, T2, T3, T4;
    identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1, T2);
    + T3 fn(T1);
    |
    - T3 fn(T1, T2, T4);
    + T3 fn(T1, T2);
    )

    @pte_alloc_func_proto depends on patch exists@
    identifier E1, E2, E4;
    type T1, T2, T3, T4;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1 E1, T2 E2);
    + T3 fn(T1 E1);
    |
    - T3 fn(T1 E1, T2 E2, T4 E4);
    + T3 fn(T1 E1, T2 E2);
    )

    @pte_alloc_func_call depends on patch exists@
    expression E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    fn(...
    -, E2
    )

    @pte_alloc_macro depends on patch exists@
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    identifier a, b, c;
    expression e;
    position p;
    @@

    (
    - #define fn(a, b, c) e
    + #define fn(a, b) e
    |
    - #define fn(a, b) e
    + #define fn(a) e
    )

    Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
    Signed-off-by: Joel Fernandes (Google)
    Suggested-by: Kirill A. Shutemov
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Julia Lawall
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     

29 Dec, 2018

2 commits

  • Userspace falls short when trying to find out whether a specific memory
    range is eligible for THP. There are usecases that would like to know
    that
    http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
    : This is used to identify heap mappings that should be able to fault thp
    : but do not, and they normally point to a low-on-memory or fragmentation
    : issue.

    The only way to deduce this now is to query for hg resp. nh flags and
    confronting the state with the global setting. Except that there is also
    PR_SET_THP_DISABLE that might change the picture. So the final logic is
    not trivial. Moreover the eligibility of the vma depends on the type of
    VMA as well. In the past we have supported only anononymous memory VMAs
    but things have changed and shmem based vmas are supported as well these
    days and the query logic gets even more complicated because the
    eligibility depends on the mount option and another global configuration
    knob.

    Simplify the current state and report the THP eligibility in
    /proc//smaps for each existing vma. Reuse
    transparent_hugepage_enabled for this purpose. The original
    implementation of this function assumes that the caller knows that the vma
    itself is supported for THP so make the core checks into
    __transparent_hugepage_enabled and use it for existing callers.
    __show_smap just use the new transparent_hugepage_enabled which also
    checks the vma support status (please note that this one has to be out of
    line due to include dependency issues).

    [mhocko@kernel.org: fix oops with NULL ->f_mapping]
    Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Dan Williams
    Cc: David Rientjes
    Cc: Jan Kara
    Cc: Mike Rapoport
    Cc: Paul Oppenheimer
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

31 Oct, 2018

1 commit

  • In DAX mode a write pagefault can race with write(2) in the following
    way:

    CPU0 CPU1
    write fault for mapped zero page (hole)
    dax_iomap_rw()
    iomap_apply()
    xfs_file_iomap_begin()
    - allocates blocks
    dax_iomap_actor()
    invalidate_inode_pages2_range()
    - invalidates radix tree entries in given range
    dax_iomap_pte_fault()
    grab_mapping_entry()
    - no entry found, creates empty
    ...
    xfs_file_iomap_begin()
    - finds already allocated block
    ...
    vmf_insert_mixed_mkwrite()
    - WARNs and does nothing because there
    is still zero page mapped in PTE
    unmap_mapping_pages()

    This race results in WARN_ON from insert_pfn() and is occasionally
    triggered by fstest generic/344. Note that the race is otherwise
    harmless as before write(2) on CPU0 is finished, we will invalidate page
    tables properly and thus user of mmap will see modified data from
    write(2) from that point on. So just restrict the warning only to the
    case when the PFN in PTE is not zero page.

    Link: http://lkml.kernel.org/r/20180824154542.26872-1-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Andrew Morton
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

27 Oct, 2018

8 commits

  • We clear the pte temporarily during read/modify/write update of the pte.
    If we take a page fault while the pte is cleared, the application can get
    SIGBUS. One such case is with remap_pfn_range without a backing
    vm_ops->fault callback. do_fault will return SIGBUS in that case.

    cpu 0 cpu1
    mprotect()
    ptep_modify_prot_start()/pte cleared.
    .
    . page fault.
    .
    .
    prep_modify_prot_commit()

    Fix this by taking page table lock and rechecking for pte_none.

    [aneesh.kumar@linux.ibm.com: fix crash observed with syzkaller run]
    Link: http://lkml.kernel.org/r/87va6bwlfg.fsf@linux.ibm.com
    Link: http://lkml.kernel.org/r/20180926031858.9692-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Cc: Willem de Bruijn
    Cc: Eric Dumazet
    Cc: Ido Schimmel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • All callers convert its errno into a vm_fault_t, so convert it to return a
    vm_fault_t directly.

    Link: http://lkml.kernel.org/r/20180828145728.11873-11-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Nicolas Pitre
    Cc: Souptick Joarder
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Both of its callers currently convert its errno return into a vm_fault_t,
    so move the conversion into __vm_insert_mixed().

    Link: http://lkml.kernel.org/r/20180828145728.11873-10-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Nicolas Pitre
    Cc: Souptick Joarder
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • vm_insert_pfn_prot() is only called from vmf_insert_pfn_prot(), so inline
    it and convert some of the errnos into vm_fault codes earlier.

    Link: http://lkml.kernel.org/r/20180828145728.11873-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Nicolas Pitre
    Cc: Souptick Joarder
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • All callers are now converted to vmf_insert_pfn() so convert
    vmf_insert_pfn() from being a compatibility wrapper around vm_insert_pfn()
    to being a compatibility wrapper around vmf_insert_pfn_prot().

    Link: http://lkml.kernel.org/r/20180828145728.11873-8-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Nicolas Pitre
    Cc: Souptick Joarder
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Now this is no longer used outside mm/memory.c, make it static.

    Link: http://lkml.kernel.org/r/20180828145728.11873-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Nicolas Pitre
    Cc: Souptick Joarder
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Like vm_insert_pfn_prot(), but returns a vm_fault_t instead of an errno.
    Also unexport vm_insert_pfn_prot as it has no modular users.

    Link: http://lkml.kernel.org/r/20180828145728.11873-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Nicolas Pitre
    Cc: Souptick Joarder
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • All callers are now converted to vmf_insert_mixed() so convert
    vmf_insert_mixed() from being a compatibility wrapper into the real
    function.

    Link: http://lkml.kernel.org/r/20180828145728.11873-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Nicolas Pitre
    Cc: Souptick Joarder
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

07 Sep, 2018

1 commit


04 Sep, 2018

1 commit

  • It is common for architectures with hugepage support to require only a
    single TLB invalidation operation per hugepage during unmap(), rather than
    iterating through the mapping at a PAGE_SIZE increment. Currently,
    however, the level in the page table where the unmap() operation occurs
    is not stored in the mmu_gather structure, therefore forcing
    architectures to issue additional TLB invalidation operations or to give
    up and over-invalidate by e.g. invalidating the entire TLB.

    Ideally, we could add an interval rbtree to the mmu_gather structure,
    which would allow us to associate the correct mapping granule with the
    various sub-mappings within the range being invalidated. However, this
    is costly in terms of book-keeping and memory management, so instead we
    approximate by keeping track of the page table levels that are cleared
    and provide a means to query the smallest granule required for invalidation.

    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Nicholas Piggin
    Signed-off-by: Will Deacon

    Will Deacon
     

26 Aug, 2018

1 commit

  • This is not normally noticeable, but repeated forks are unnecessarily
    expensive because they repeatedly dirty the parent page tables during
    the page table copy operation.

    It's trivial to just avoid write protecting the page table entry if it
    was already not writable.

    This patch was inspired by

    https://bugzilla.kernel.org/show_bug.cgi?id=200447

    which points to an ancient "waste time re-doing fork" issue in the
    presence of lots of signals.

    That bug was fixed by Eric Biederman's signal handling series
    culminating in commit c3ad2c3b02e9 ("signal: Don't restart fork when
    signals come in"), but the unnecessary work for repeated forks is still
    work just fixing, particularly since the fix is trivial.

    Cc: Eric Biederman
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Aug, 2018

6 commits

  • Merge yet more updates from Andrew Morton:

    - the rest of MM

    - various misc fixes and tweaks

    * emailed patches from Andrew Morton : (22 commits)
    mm: Change return type int to vm_fault_t for fault handlers
    lib/fonts: convert comments to utf-8
    s390: ebcdic: convert comments to UTF-8
    treewide: convert ISO_8859-1 text comments to utf-8
    drivers/gpu/drm/gma500/: change return type to vm_fault_t
    docs/core-api: mm-api: add section about GFP flags
    docs/mm: make GFP flags descriptions usable as kernel-doc
    docs/core-api: split memory management API to a separate file
    docs/core-api: move *{str,mem}dup* to "String Manipulation"
    docs/core-api: kill trailing whitespace in kernel-api.rst
    mm/util: add kernel-doc for kvfree
    mm/util: make strndup_user description a kernel-doc comment
    fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
    treewide: correct "differenciate" and "instanciate" typos
    fs/afs: use new return type vm_fault_t
    drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
    mm: soft-offline: close the race against page allocation
    mm: fix race on soft-offlining free huge pages
    namei: allow restricted O_CREAT of FIFOs and regular files
    hfs: prevent crash on exit from failed search
    ...

    Linus Torvalds
     
  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    The aim is to change the return type of finish_fault() and
    handle_mm_fault() to vm_fault_t type. As part of that clean up return
    type of all other recursively called functions have been changed to
    vm_fault_t type.

    The places from where handle_mm_fault() is getting invoked will be
    change to vm_fault_t type but in a separate patch.

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
    Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • The generic tlb_end_vma does not call invalidate_range mmu notifier, and
    it resets resets the mmu_gather range, which means the notifier won't be
    called on part of the range in case of an unmap that spans multiple
    vmas.

    ARM64 seems to be the only arch I could see that has notifiers and uses
    the generic tlb_end_vma. I have not actually tested it.

    [ Catalin and Will point out that ARM64 currently only uses the
    notifiers for KVM, which doesn't use the ->invalidate_range()
    callback right now, so it's a bug, but one that happens to
    not affect them. So not necessary for stable. - Linus ]

    Signed-off-by: Nicholas Piggin
    Acked-by: Catalin Marinas
    Acked-by: Will Deacon
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • Jann reported that x86 was missing required TLB invalidates when he
    hit the !*batch slow path in tlb_remove_table().

    This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
    invalidates, the PowerPC-hash where this code originated and the
    Sparc-hash where this was subsequently used did not need that. ARM
    which later used this put an explicit TLB invalidate in their
    __p*_free_tlb() functions, and PowerPC-radix followed that example.

    But when we hooked up x86 we failed to consider this. Fix this by
    (optionally) hooking tlb_remove_table() into the TLB invalidate code.

    NOTE: s390 was also needing something like this and might now
    be able to use the generic code again.

    [ Modified to be on top of Nick's cleanups, which simplified this patch
    now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]

    Fixes: 9e52fc2b50de ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")
    Reported-by: Jann Horn
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Cc: Nicholas Piggin
    Cc: David Miller
    Cc: Will Deacon
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Will noted that only checking mm_users is incorrect; we should also
    check mm_count in order to cover CPUs that have a lazy reference to
    this mm (and could do speculative TLB operations).

    If removing this turns out to be a performance issue, we can
    re-instate a more complete check, but in tlb_table_flush() eliding the
    call_rcu_sched().

    Fixes: 267239116987 ("mm, powerpc: move the RCU page-table freeing into generic code")
    Reported-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Acked-by: Will Deacon
    Cc: Nicholas Piggin
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • There is no need to call this from tlb_flush_mmu_tlbonly, it logically
    belongs with tlb_flush_mmu_free. This makes future fixes simpler.

    [ This was originally done to allow code consolidation for the
    mmu_notifier fix, but it also ends up helping simplify the
    HAVE_RCU_TABLE_INVALIDATE fix. - Linus ]

    Signed-off-by: Nicholas Piggin
    Acked-by: Will Deacon
    Cc: Peter Zijlstra
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Nicholas Piggin