01 Oct, 2020

2 commits

  • [ Upstream commit c3e5ea6ee574ae5e845a40ac8198de1fb63bb3ab ]

    Jeff Moyer has reported that one of xfstests triggers a warning when run
    on DAX-enabled filesystem:

    WARNING: CPU: 76 PID: 51024 at mm/memory.c:2317 wp_page_copy+0xc40/0xd50
    ...
    wp_page_copy+0x98c/0xd50 (unreliable)
    do_wp_page+0xd8/0xad0
    __handle_mm_fault+0x748/0x1b90
    handle_mm_fault+0x120/0x1f0
    __do_page_fault+0x240/0xd70
    do_page_fault+0x38/0xd0
    handle_page_fault+0x10/0x30

    The warning happens on failed __copy_from_user_inatomic() which tries to
    copy data into a CoW page.

    This happens because of race between MADV_DONTNEED and CoW page fault:

    CPU0 CPU1
    handle_mm_fault()
    do_wp_page()
    wp_page_copy()
    do_wp_page()
    madvise(MADV_DONTNEED)
    zap_page_range()
    zap_pte_range()
    ptep_get_and_clear_full()

    __copy_from_user_inatomic()
    sees empty PTE and fails
    WARN_ON_ONCE(1)
    clear_page()

    The solution is to re-try __copy_from_user_inatomic() under PTL after
    checking that PTE is matches the orig_pte.

    The second copy attempt can still fail, like due to non-readable PTE, but
    there's nothing reasonable we can do about, except clearing the CoW page.

    Reported-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Jeff Moyer
    Cc:
    Cc: Justin He
    Cc: Dan Williams
    Link: http://lkml.kernel.org/r/20200218154151.13349-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Kirill A. Shutemov
     
  • [ Upstream commit 83d116c53058d505ddef051e90ab27f57015b025 ]

    When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there
    will be a double page fault in __copy_from_user_inatomic of cow_user_page.

    To reproduce the bug, the cmd is as follows after you deployed everything:
    make -C src/test/vmmalloc_fork/ TEST_TIME=60m check

    Below call trace is from arm64 do_page_fault for debugging purpose:
    [ 110.016195] Call trace:
    [ 110.016826] do_page_fault+0x5a4/0x690
    [ 110.017812] do_mem_abort+0x50/0xb0
    [ 110.018726] el1_da+0x20/0xc4
    [ 110.019492] __arch_copy_from_user+0x180/0x280
    [ 110.020646] do_wp_page+0xb0/0x860
    [ 110.021517] __handle_mm_fault+0x994/0x1338
    [ 110.022606] handle_mm_fault+0xe8/0x180
    [ 110.023584] do_page_fault+0x240/0x690
    [ 110.024535] do_mem_abort+0x50/0xb0
    [ 110.025423] el0_da+0x20/0x24

    The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
    [ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003,
    pmd=000000023d4b3003, pte=360000298607bd3

    As told by Catalin: "On arm64 without hardware Access Flag, copying from
    user will fail because the pte is old and cannot be marked young. So we
    always end up with zeroed page after fork() + CoW for pfn mappings. we
    don't always have a hardware-managed access flag on arm64."

    This patch fixes it by calling pte_mkyoung. Also, the parameter is
    changed because vmf should be passed to cow_user_page()

    Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
    in case there can be some obscure use-case (by Kirill).

    [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

    Signed-off-by: Jia He
    Reported-by: Yibo Cai
    Reviewed-by: Catalin Marinas
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Catalin Marinas
    Signed-off-by: Sasha Levin

    Jia He
     

09 Jan, 2020

1 commit

  • [ Upstream commit 89b15332af7c0312a41e50846819ca6613b58b4c ]

    One of our services is observing hanging ps/top/etc under heavy write
    IO, and the task states show this is an mmap_sem priority inversion:

    A write fault is holding the mmap_sem in read-mode and waiting for
    (heavily cgroup-limited) IO in balance_dirty_pages():

    balance_dirty_pages+0x724/0x905
    balance_dirty_pages_ratelimited+0x254/0x390
    fault_dirty_shared_page.isra.96+0x4a/0x90
    do_wp_page+0x33e/0x400
    __handle_mm_fault+0x6f0/0xfa0
    handle_mm_fault+0xe4/0x200
    __do_page_fault+0x22b/0x4a0
    page_fault+0x45/0x50

    Somebody tries to change the address space, contending for the mmap_sem in
    write-mode:

    call_rwsem_down_write_failed_killable+0x13/0x20
    do_mprotect_pkey+0xa8/0x330
    SyS_mprotect+0xf/0x20
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The waiting writer locks out all subsequent readers to avoid lock
    starvation, and several threads can be seen hanging like this:

    call_rwsem_down_read_failed+0x14/0x30
    proc_pid_cmdline_read+0xa0/0x480
    __vfs_read+0x23/0x140
    vfs_read+0x87/0x130
    SyS_read+0x42/0x90
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    To fix this, do what we do for cache read faults already: drop the
    mmap_sem before calling into anything IO bound, in this case the
    balance_dirty_pages() function, and return VM_FAULT_RETRY.

    Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Kirill A. Shutemov
    Cc: Josef Bacik
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Johannes Weiner
     

25 Sep, 2019

3 commits

  • Using %px to show the actual address in print_bad_pte()
    to help us to debug issue.

    Link: http://lkml.kernel.org/r/20190831011816.141002-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     
  • In our testing (camera recording), Miguel and Wei found
    unmap_page_range() takes above 6ms with preemption disabled easily.
    When I see that, the reason is it holds page table spinlock during
    entire 512 page operation in a PMD. 6.2ms is never trivial for user
    experince if RT task couldn't run in the time because it could make
    frame drop or glitch audio problem.

    I had a time to benchmark it via adding some trace_printk hooks between
    pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
    device is 2018 premium mobile device.

    I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
    task runs on little core even though it doesn't have any IPI and LRU
    lock contention. It's already too heavy.

    If I remove activate_page, 35-40% overhead of zap_pte_range is gone so
    most of overhead(about 0.7ms) comes from activate_page via
    mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
    accumulate up to several ms.

    So this patch adds a check for need_resched() in the loop, and a
    preemption point if necessary.

    Link: http://lkml.kernel.org/r/20190731061440.GC155569@google.com
    Signed-off-by: Minchan Kim
    Reported-by: Miguel de Dios
    Reported-by: Wei Wang
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Since ptent will not be changed after previous assignment of entry, it is
    not necessary to do the assignment again.

    Link: http://lkml.kernel.org/r/20190708082740.21111-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Acked-by: Matthew Wilcox (Oracle)
    Cc: Will Deacon
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     

20 Aug, 2019

1 commit


19 Jul, 2019

1 commit

  • transhuge_vma_suitable() was only available for shmem THP, but anonymous
    THP has the same check except pgoff check. And, it will be used for THP
    eligible check in the later patch, so make it available for all kind of
    THPs. This also helps reduce code duplication slightly.

    Since anonymous THP doesn't have to check pgoff, so make pgoff check
    shmem vma only.

    And regroup some functions in include/linux/mm.h to solve compile issue
    since transhuge_vma_suitable() needs call vma_is_anonymous() which was
    defined after huge_mm.h is included.

    [akpm@linux-foundation.org: fix typo]
    [yang.shi@linux.alibaba.com: v4]
    Link: http://lkml.kernel.org/r/1563400758-124759-2-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1560401041-32207-2-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

15 Jul, 2019

1 commit

  • Pull HMM updates from Jason Gunthorpe:
    "Improvements and bug fixes for the hmm interface in the kernel:

    - Improve clarity, locking and APIs related to the 'hmm mirror'
    feature merged last cycle. In linux-next we now see AMDGPU and
    nouveau to be using this API.

    - Remove old or transitional hmm APIs. These are hold overs from the
    past with no users, or APIs that existed only to manage cross tree
    conflicts. There are still a few more of these cleanups that didn't
    make the merge window cut off.

    - Improve some core mm APIs:
    - export alloc_pages_vma() for driver use
    - refactor into devm_request_free_mem_region() to manage
    DEVICE_PRIVATE resource reservations
    - refactor duplicative driver code into the core dev_pagemap
    struct

    - Remove hmm wrappers of improved core mm APIs, instead have drivers
    use the simplified API directly

    - Remove DEVICE_PUBLIC

    - Simplify the kconfig flow for the hmm users and core code"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
    mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
    mm: remove the HMM config option
    mm: sort out the DEVICE_PRIVATE Kconfig mess
    mm: simplify ZONE_DEVICE page private data
    mm: remove hmm_devmem_add
    mm: remove hmm_vma_alloc_locked_page
    nouveau: use devm_memremap_pages directly
    nouveau: use alloc_page_vma directly
    PCI/P2PDMA: use the dev_pagemap internal refcount
    device-dax: use the dev_pagemap internal refcount
    memremap: provide an optional internal refcount in struct dev_pagemap
    memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
    memremap: remove the data field in struct dev_pagemap
    memremap: add a migrate_to_ram method to struct dev_pagemap_ops
    memremap: lift the devmap_enable manipulation into devm_memremap_pages
    memremap: pass a struct dev_pagemap to ->kill and ->cleanup
    memremap: move dev_pagemap callbacks into a separate structure
    memremap: validate the pagemap type passed to devm_memremap_pages
    mm: factor out a devm_request_free_mem_region helper
    mm: export alloc_pages_vma
    ...

    Linus Torvalds
     

13 Jul, 2019

5 commits

  • This function is used by ptrace and proc files like /proc/pid/cmdline and
    /proc/pid/environ.

    Access_remote_vm never returns error codes, all errors are ignored and
    only size of successfully read data is returned. So, if current task was
    killed we'll simply return 0 (bytes read).

    Mmap_sem could be locked for a long time or forever if something goes
    wrong. Using a killable lock permits cleanup of stuck tasks and
    simplifies investigation.

    Link: http://lkml.kernel.org/r/156007494202.3335.16782303099589302087.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Michal Koutný
    Acked-by: Oleg Nesterov
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Matthew Wilcox
    Cc: Cyrill Gorcunov
    Cc: Kirill Tkhai
    Cc: Al Viro
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • If the caller asks us for offset == num, we should already fail in the
    first check, i.e. the one testing for offsets beyond the object.

    At the moment, we are failing on the second test anyway, since count
    cannot be 0. Still, to agree with the comment of the first test, we
    should first test it there.

    Link: http://lkml.kernel.org/r/20190528193004.GA7744@gmail.com
    Signed-off-by: Miguel Ojeda
    Reviewed-by: Andrew Morton
    Cc: Souptick Joarder
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miguel Ojeda
     
  • Drop the pgtable_t variable from all implementation for pte_fn_t as none
    of them use it. apply_to_pte_range() should stop computing it as well.
    Should help us save some cycles.

    Link: http://lkml.kernel.org/r/1556803126-26596-1-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Acked-by: Matthew Wilcox
    Cc: Ard Biesheuvel
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Michal Hocko
    Cc: Logan Gunthorpe
    Cc: "Kirill A. Shutemov"
    Cc: Dan Williams
    Cc:
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • When swapin is performed, after getting the swap entry information from
    the page table, system will swap in the swap entry, without any lock held
    to prevent the swap device from being swapoff. This may cause the race
    like below,

    CPU 1 CPU 2
    ----- -----
    do_swap_page
    swapin_readahead
    __read_swap_cache_async
    swapoff swapcache_prepare
    p->swap_map = NULL __swap_duplicate
    p->swap_map[?] /* !!! NULL pointer access */

    Because swapoff is usually done when system shutdown only, the race may
    not hit many people in practice. But it is still a race need to be fixed.

    To fix the race, get_swap_device() is added to check whether the specified
    swap entry is valid in its swap device. If so, it will keep the swap
    entry valid via preventing the swap device from being swapoff, until
    put_swap_device() is called.

    Because swapoff() is very rare code path, to make the normal path runs as
    fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
    reference count is used to implement get/put_swap_device(). >From
    get_swap_device() to put_swap_device(), RCU reader side is locked, so
    synchronize_rcu() in swapoff() will wait until put_swap_device() is
    called.

    In addition to swap_map, cluster_info, etc. data structure in the struct
    swap_info_struct, the swap cache radix tree will be freed after swapoff,
    so this patch fixes the race between swap cache looking up and swapoff
    too.

    Races between some other swap cache usages and swapoff are fixed too via
    calling synchronize_rcu() between clearing PageSwapCache() and freeing
    swap cache data structure.

    Another possible method to fix this is to use preempt_off() +
    stop_machine() to prevent the swap device from being swapoff when its data
    structure is being accessed. The overhead in hot-path of both methods is
    similar. The advantages of RCU based method are,

    1. stop_machine() may disturb the normal execution code path on other
    CPUs.

    2. File cache uses RCU to protect its radix tree. If the similar
    mechanism is used for swap cache too, it is easier to share code
    between them.

    3. RCU is used to protect swap cache in total_swapcache_pages() and
    exit_swap_address_space() already. The two mechanisms can be
    merged to simplify the logic.

    Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
    Fixes: 235b62176712 ("mm/swap: add cluster lock")
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrea Parri
    Not-nacked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Paul E. McKenney
    Cc: Daniel Jordan
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Tim Chen
    Cc: Mel Gorman
    Cc: Jérôme Glisse
    Cc: Yang Shi
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Jan Kara
    Cc: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Make the success case use the same cleanup path as the failure case.

    Link: http://lkml.kernel.org/r/20190523134024.GC24093@localhost.localdomain
    Signed-off-by: Miklos Szeredi
    Reviewed-by: Andrew Morton
    Reviewed-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

03 Jul, 2019

2 commits


21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

3 commits

  • Patch series "mm: Use vm_map_pages() and vm_map_pages_zero() API", v5.

    This patch (of 5):

    Previouly drivers have their own way of mapping range of kernel
    pages/memory into user vma and this was done by invoking vm_insert_page()
    within a loop.

    As this pattern is common across different drivers, it can be generalized
    by creating new functions and using them across the drivers.

    vm_map_pages() is the API which can be used to map kernel memory/pages in
    drivers which have considered vm_pgoff

    vm_map_pages_zero() is the API which can be used to map a range of kernel
    memory/pages in drivers which have not considered vm_pgoff. vm_pgoff is
    passed as default 0 for those drivers.

    We _could_ then at a later "fix" these drivers which are using
    vm_map_pages_zero() to behave according to the normal vm_pgoff offsetting
    simply by removing the _zero suffix on the function name and if that
    causes regressions, it gives us an easy way to revert.

    Tested on Rockchip hardware and display is working, including talking to
    Lima via prime.

    Link: http://lkml.kernel.org/r/751cb8a0f4c3e67e95c58a3b072937617f338eea.1552921225.git.jrdr.linux@gmail.com
    Signed-off-by: Souptick Joarder
    Suggested-by: Russell King
    Suggested-by: Matthew Wilcox
    Reviewed-by: Mike Rapoport
    Tested-by: Heiko Stuebner
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: Robin Murphy
    Cc: Joonsoo Kim
    Cc: Thierry Reding
    Cc: Kees Cook
    Cc: Marek Szyprowski
    Cc: Stefan Richter
    Cc: Sandy Huang
    Cc: David Airlie
    Cc: Oleksandr Andrushchenko
    Cc: Joerg Roedel
    Cc: Pawel Osciak
    Cc: Kyungmin Park
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Mauro Carvalho Chehab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

08 May, 2019

1 commit

  • Pull printk updates from Petr Mladek:

    - Allow state reset of printk_once() calls.

    - Prevent crashes when dereferencing invalid pointers in vsprintf().
    Only the first byte is checked for simplicity.

    - Make vsprintf warnings consistent and inlined.

    - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf
    modifiers.

    - Some clean up of vsprintf and test_printf code.

    * tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
    lib/vsprintf: Make function pointer_string static
    vsprintf: Limit the length of inlined error messages
    vsprintf: Avoid confusion between invalid address and value
    vsprintf: Prevent crash when dereferencing invalid pointers
    vsprintf: Consolidate handling of unknown pointer specifiers
    vsprintf: Factor out %pO handler as kobject_string()
    vsprintf: Factor out %pV handler as va_format()
    vsprintf: Factor out %p[iI] handler as ip_addr_string()
    vsprintf: Do not check address of well-known strings
    vsprintf: Consistent %pK handling for kptr_restrict == 0
    vsprintf: Shuffle restricted_pointer()
    printk: Tie printk_once / printk_deferred_once into .data.once for reset
    treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
    lib/test_printf: Switch to bitmap_zalloc()

    Linus Torvalds
     

09 Apr, 2019

1 commit

  • %pF and %pf are functionally equivalent to %pS and %ps conversion
    specifiers. The former are deprecated, therefore switch the current users
    to use the preferred variant.

    The changes have been produced by the following command:

    git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
    while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done

    And verifying the result.

    Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
    Cc: Andy Shevchenko
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: sparclinux@vger.kernel.org
    Cc: linux-um@lists.infradead.org
    Cc: xen-devel@lists.xenproject.org
    Cc: linux-acpi@vger.kernel.org
    Cc: linux-pm@vger.kernel.org
    Cc: drbd-dev@lists.linbit.com
    Cc: linux-block@vger.kernel.org
    Cc: linux-mmc@vger.kernel.org
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-pci@vger.kernel.org
    Cc: linux-scsi@vger.kernel.org
    Cc: linux-btrfs@vger.kernel.org
    Cc: linux-f2fs-devel@lists.sourceforge.net
    Cc: linux-mm@kvack.org
    Cc: ceph-devel@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Signed-off-by: Sakari Ailus
    Acked-by: David Sterba (for btrfs)
    Acked-by: Mike Rapoport (for mm/memblock.c)
    Acked-by: Bjorn Helgaas (for drivers/pci)
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Petr Mladek

    Sakari Ailus
     

03 Apr, 2019

2 commits

  • As the comment notes; it is a potentially dangerous operation. Just
    use tlb_flush_mmu(), that will skip the (double) TLB invalidate if
    it really isn't needed anyway.

    No change in behavior intended.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Move the mmu_gather::page_size things into the generic code instead of
    PowerPC specific bits.

    No change in behavior intended.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

30 Mar, 2019

1 commit

  • Aneesh has reported that PPC triggers the following warning when
    excercising DAX code:

    IP set_pte_at+0x3c/0x190
    LR insert_pfn+0x208/0x280
    Call Trace:
    insert_pfn+0x68/0x280
    dax_iomap_pte_fault.isra.7+0x734/0xa40
    __xfs_filemap_fault+0x280/0x2d0
    do_wp_page+0x48c/0xa40
    __handle_mm_fault+0x8d0/0x1fd0
    handle_mm_fault+0x140/0x250
    __do_page_fault+0x300/0xd60
    handle_page_fault+0x18

    Now that is WARN_ON in set_pte_at which is

    VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));

    The problem is that on some architectures set_pte_at() cannot cope with
    a situation where there is already some (different) valid entry present.

    Use ptep_set_access_flags() instead to modify the pfn which is built to
    deal with modifying existing PTE.

    Link: http://lkml.kernel.org/r/20190311084537.16029-1-jack@suse.cz
    Fixes: b2770da64254 "mm: add vm_insert_mixed_mkwrite()"
    Signed-off-by: Jan Kara
    Reported-by: "Aneesh Kumar K.V"
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Dan Williams
    Cc: Chandan Rajendra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

06 Mar, 2019

9 commits

  • LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
    This is a stress test, where one thread mmaps/writes/munmaps memory area
    and other thread is trying to read from it:

    CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
    Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
    Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
    Call Trace:
    ([] (null))
    [] lock_acquire+0xec/0x258
    [] _raw_spin_lock_bh+0x5c/0x98
    [] page_table_free+0x48/0x1a8
    [] do_fault+0xdc/0x670
    [] __handle_mm_fault+0x416/0x5f0
    [] handle_mm_fault+0x1b0/0x320
    [] do_dat_exception+0x19c/0x2c8
    [] pgm_check_handler+0x19e/0x200

    page_table_free() is called with NULL mm parameter, but because "0" is a
    valid address on s390 (see S390_lowcore), it keeps going until it
    eventually crashes in lockdep's lock_acquire. This crash is
    reproducible at least since 4.14.

    Problem is that "vmf->vma" used in do_fault() can become stale. Because
    mmap_sem may be released, other threads can come in, call munmap() and
    cause "vma" be returned to kmem cache, and get zeroed/re-initialized and
    re-used:

    handle_mm_fault |
    __handle_mm_fault |
    do_fault |
    vma = vmf->vma |
    do_read_fault |
    __do_fault |
    vma->vm_ops->fault(vmf); |
    mmap_sem is released |
    |
    | do_munmap()
    | remove_vma_list()
    | remove_vma()
    | vm_area_free()
    | # vma is released
    | ...
    | # same vma is allocated
    | # from kmem cache
    | do_mmap()
    | vm_area_alloc()
    | memset(vma, 0, ...)
    |
    pte_free(vma->vm_mm, ...); |
    page_table_free |
    spin_lock_bh(&mm->context.lock);|
    |

    Cache mm_struct to avoid using potentially stale "vma".

    [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c

    Link: http://lkml.kernel.org/r/5b3fdf19e2a5be460a384b936f5b56e13733f1b8.1551595137.git.jstancek@redhat.com
    Signed-off-by: Jan Stancek
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Matthew Wilcox
    Acked-by: Rafael Aquini
    Reviewed-by: Minchan Kim
    Acked-by: Kirill A. Shutemov
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Huang Ying
    Cc: Souptick Joarder
    Cc: Jerome Glisse
    Cc: Aneesh Kumar K.V
    Cc: David Hildenbrand
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Stancek
     
  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Architectures like ppc64 require to do a conditional tlb flush based on
    the old and new value of pte. Enable that by passing old pte value as
    the arg.

    Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Paul Mackerras
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Patch series "NestMMU pte upgrade workaround for mprotect", v5.

    We can upgrade pte access (R -> RW transition) via mprotect. We need to
    make sure we follow the recommended pte update sequence as outlined in
    commit bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to
    handle nest MMU hang") for such updates. This patch series does that.

    This patch (of 5):

    Some architectures may want to call flush_tlb_range from these helpers.

    Link: http://lkml.kernel.org/r/20190116085035.29729-2-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Nicholas Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • When calling debugfs functions, there is no need to ever check the
    return value. The function can work or not, but the code logic should
    never do something different based on this.

    Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org
    Signed-off-by: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Kroah-Hartman
     
  • Pages which use page_type must never be mapped to userspace as it would
    destroy their page type. Add an explicit check for this instead of
    assuming that kernel drivers always get this right.

    Link: http://lkml.kernel.org/r/20190129053830.3749-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Kees Cook
    Reviewed-by: David Hildenbrand
    Cc: Michael Ellerman
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • It's never appropriate to map a page allocated by SLAB into userspace.
    A buggy device driver might try this, or an attacker might be able to
    find a way to make it happen.

    Christoph said:

    : Let's just fail the code. Currently this may work with SLUB. But SLAB
    : and SLOB overlay fields with mapcount. So you would have a corrupted page
    : struct if you mapped a slab page to user space.

    Link: http://lkml.kernel.org/r/20190125173827.2658-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Kees Cook
    Acked-by: Pekka Enberg
    Cc: Rik van Riel
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Add an optimization for KSM pages almost in the same way that we have
    for ordinary anonymous pages. If there is a write fault in a page,
    which is mapped to an only pte, and it is not related to swap cache; the
    page may be reused without copying its content.

    [ Note that we do not consider PageSwapCache() pages at least for now,
    since we don't want to complicate __get_ksm_page(), which has nice
    optimization based on this (for the migration case). Currenly it is
    spinning on PageSwapCache() pages, waiting for when they have
    unfreezed counters (i.e., for the migration finish). But we don't want
    to make it also spinning on swap cache pages, which we try to reuse,
    since there is not a very high probability to reuse them. So, for now
    we do not consider PageSwapCache() pages at all. ]

    So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
    page_stable_node(), to skip a page, which KSM is currently trying to
    link to stable tree. Then we do page_ref_freeze() to prohibit KSM to
    merge one more page into the page, we are reusing. After that, nobody
    can refer to the reusing page: KSM skips !PageSwapCache() pages with
    zero refcount; and the protection against of all other participants is
    the same as for reused ordinary anon pages pte lock, page lock and
    mmap_sem.

    [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
    Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Reviewed-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Christian Koenig
    Cc: Claudio Imbrenda
    Cc: Rik van Riel
    Cc: Huang Ying
    Cc: Minchan Kim
    Cc: Kirill Tkhai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

    All these places for replacement were found by running the following
    grep patterns on the entire kernel code. Please let me know if this
    might have missed some instances. This might also have replaced some
    false positives. I will appreciate suggestions, inputs and review.

    1. git grep "nid == -1"
    2. git grep "node == -1"
    3. git grep "nid = -1"
    4. git grep "node = -1"

    This patch (of 2):

    At present there are multiple places where invalid node number is
    encoded as -1. Even though implicitly understood it is always better to
    have macros in there. Replace these open encodings for an invalid node
    number with the global macro NUMA_NO_NODE. This helps remove NUMA
    related assumptions like 'invalid node' from various places redirecting
    them to a common definition.

    Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: David Hildenbrand
    Acked-by: Jeff Kirsher [ixgbe]
    Acked-by: Jens Axboe [mtip32xx]
    Acked-by: Vinod Koul [dmaengine.c]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Doug Ledford [drivers/infiniband]
    Cc: Joseph Qi
    Cc: Hans Verkuil
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

09 Jan, 2019

2 commits

  • One of the paths in follow_pte_pmd() initialised the mmu_notifier_range
    incorrectly.

    Link: http://lkml.kernel.org/r/20190103002126.GM6310@bombadil.infradead.org
    Fixes: ac46d4f3c432 ("mm/mmu_notifier: use structure for invalidate_range_start/end calls v2")
    Signed-off-by: Matthew Wilcox
    Tested-by: Dave Chinner
    Reviewed-by: Jérôme Glisse
    Cc: John Hubbard
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
    ext4 writeback

    task1:
    wait_on_page_bit+0x82/0xa0
    shrink_page_list+0x907/0x960
    shrink_inactive_list+0x2c7/0x680
    shrink_node_memcg+0x404/0x830
    shrink_node+0xd8/0x300
    do_try_to_free_pages+0x10d/0x330
    try_to_free_mem_cgroup_pages+0xd5/0x1b0
    try_charge+0x14d/0x720
    memcg_kmem_charge_memcg+0x3c/0xa0
    memcg_kmem_charge+0x7e/0xd0
    __alloc_pages_nodemask+0x178/0x260
    alloc_pages_current+0x95/0x140
    pte_alloc_one+0x17/0x40
    __pte_alloc+0x1e/0x110
    alloc_set_pte+0x5fe/0xc20
    do_fault+0x103/0x970
    handle_mm_fault+0x61e/0xd10
    __do_page_fault+0x252/0x4d0
    do_page_fault+0x30/0x80
    page_fault+0x28/0x30

    task2:
    __lock_page+0x86/0xa0
    mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
    ext4_writepages+0x479/0xd60
    do_writepages+0x1e/0x30
    __writeback_single_inode+0x45/0x320
    writeback_sb_inodes+0x272/0x600
    __writeback_inodes_wb+0x92/0xc0
    wb_writeback+0x268/0x300
    wb_workfn+0xb4/0x390
    process_one_work+0x189/0x420
    worker_thread+0x4e/0x4b0
    kthread+0xe6/0x100
    ret_from_fork+0x41/0x50

    He adds
    "task1 is waiting for the PageWriteback bit of the page that task2 has
    collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
    LOCKED bit the page which tasks1 has locked"

    More precisely task1 is handling a page fault and it has a page locked
    while it charges a new page table to a memcg. That in turn hits a
    memory limit reclaim and the memcg reclaim for legacy controller is
    waiting on the writeback but that is never going to finish because the
    writeback itself is waiting for the page locked in the #PF path. So
    this is essentially ABBA deadlock:

    lock_page(A)
    SetPageWriteback(A)
    unlock_page(A)
    lock_page(B)
    lock_page(B)
    pte_alloc_pne
    shrink_page_list
    wait_on_page_writeback(A)
    SetPageWriteback(B)
    unlock_page(B)

    # flush A, B to clear the writeback

    This accumulating of more pages to flush is used by several filesystems
    to generate a more optimal IO patterns.

    Waiting for the writeback in legacy memcg controller is a workaround for
    pre-mature OOM killer invocations because there is no dirty IO
    throttling available for the controller. There is no easy way around
    that unfortunately. Therefore fix this specific issue by pre-allocating
    the page table outside of the page lock. We have that handy
    infrastructure for that already so simply reuse the fault-around pattern
    which already does this.

    There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
    from under a fs page locked but they should be really rare. I am not
    aware of a better solution unfortunately.

    [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@kernel.org: enhance comment, per Johannes]
    Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
    Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
    Signed-off-by: Michal Hocko
    Reported-by: Liu Bo
    Debugged-by: Liu Bo
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Reviewed-by: Liu Bo
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Vladimir Davydov
    Cc: Shakeel Butt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Jan, 2019

1 commit

  • Patch series "Add support for fast mremap".

    This series speeds up the mremap(2) syscall by copying page tables at
    the PMD level even for non-THP systems. There is concern that the extra
    'address' argument that mremap passes to pte_alloc may do something
    subtle architecture related in the future that may make the scheme not
    work. Also we find that there is no point in passing the 'address' to
    pte_alloc since its unused. This patch therefore removes this argument
    tree-wide resulting in a nice negative diff as well. Also ensuring
    along the way that the enabled architectures do not do anything funky
    with the 'address' argument that goes unnoticed by the optimization.

    Build and boot tested on x86-64. Build tested on arm64. The config
    enablement patch for arm64 will be posted in the future after more
    testing.

    The changes were obtained by applying the following Coccinelle script.
    (thanks Julia for answering all Coccinelle questions!).
    Following fix ups were done manually:
    * Removal of address argument from pte_fragment_alloc
    * Removal of pte_alloc_one_fast definitions from m68k and microblaze.

    // Options: --include-headers --no-includes
    // Note: I split the 'identifier fn' line, so if you are manually
    // running it, please unsplit it so it runs for you.

    virtual patch

    @pte_alloc_func_def depends on patch exists@
    identifier E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    type T2;
    @@

    fn(...
    - , T2 E2
    )
    { ... }

    @pte_alloc_func_proto_noarg depends on patch exists@
    type T1, T2, T3, T4;
    identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1, T2);
    + T3 fn(T1);
    |
    - T3 fn(T1, T2, T4);
    + T3 fn(T1, T2);
    )

    @pte_alloc_func_proto depends on patch exists@
    identifier E1, E2, E4;
    type T1, T2, T3, T4;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1 E1, T2 E2);
    + T3 fn(T1 E1);
    |
    - T3 fn(T1 E1, T2 E2, T4 E4);
    + T3 fn(T1 E1, T2 E2);
    )

    @pte_alloc_func_call depends on patch exists@
    expression E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    fn(...
    -, E2
    )

    @pte_alloc_macro depends on patch exists@
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    identifier a, b, c;
    expression e;
    position p;
    @@

    (
    - #define fn(a, b, c) e
    + #define fn(a, b) e
    |
    - #define fn(a, b) e
    + #define fn(a) e
    )

    Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
    Signed-off-by: Joel Fernandes (Google)
    Suggested-by: Kirill A. Shutemov
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Julia Lawall
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     

29 Dec, 2018

2 commits

  • Userspace falls short when trying to find out whether a specific memory
    range is eligible for THP. There are usecases that would like to know
    that
    http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
    : This is used to identify heap mappings that should be able to fault thp
    : but do not, and they normally point to a low-on-memory or fragmentation
    : issue.

    The only way to deduce this now is to query for hg resp. nh flags and
    confronting the state with the global setting. Except that there is also
    PR_SET_THP_DISABLE that might change the picture. So the final logic is
    not trivial. Moreover the eligibility of the vma depends on the type of
    VMA as well. In the past we have supported only anononymous memory VMAs
    but things have changed and shmem based vmas are supported as well these
    days and the query logic gets even more complicated because the
    eligibility depends on the mount option and another global configuration
    knob.

    Simplify the current state and report the THP eligibility in
    /proc//smaps for each existing vma. Reuse
    transparent_hugepage_enabled for this purpose. The original
    implementation of this function assumes that the caller knows that the vma
    itself is supported for THP so make the core checks into
    __transparent_hugepage_enabled and use it for existing callers.
    __show_smap just use the new transparent_hugepage_enabled which also
    checks the vma support status (please note that this one has to be out of
    line due to include dependency issues).

    [mhocko@kernel.org: fix oops with NULL ->f_mapping]
    Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Dan Williams
    Cc: David Rientjes
    Cc: Jan Kara
    Cc: Mike Rapoport
    Cc: Paul Oppenheimer
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

31 Oct, 2018

1 commit

  • In DAX mode a write pagefault can race with write(2) in the following
    way:

    CPU0 CPU1
    write fault for mapped zero page (hole)
    dax_iomap_rw()
    iomap_apply()
    xfs_file_iomap_begin()
    - allocates blocks
    dax_iomap_actor()
    invalidate_inode_pages2_range()
    - invalidates radix tree entries in given range
    dax_iomap_pte_fault()
    grab_mapping_entry()
    - no entry found, creates empty
    ...
    xfs_file_iomap_begin()
    - finds already allocated block
    ...
    vmf_insert_mixed_mkwrite()
    - WARNs and does nothing because there
    is still zero page mapped in PTE
    unmap_mapping_pages()

    This race results in WARN_ON from insert_pfn() and is occasionally
    triggered by fstest generic/344. Note that the race is otherwise
    harmless as before write(2) on CPU0 is finished, we will invalidate page
    tables properly and thus user of mmap will see modified data from
    write(2) from that point on. So just restrict the warning only to the
    case when the PFN in PTE is not zero page.

    Link: http://lkml.kernel.org/r/20180824154542.26872-1-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Andrew Morton
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara