24 Mar, 2019

1 commit

  • commit fc8efd2ddfed3f343c11b693e87140ff358d7ff5 upstream.

    LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
    This is a stress test, where one thread mmaps/writes/munmaps memory area
    and other thread is trying to read from it:

    CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
    Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
    Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
    Call Trace:
    ([] (null))
    [] lock_acquire+0xec/0x258
    [] _raw_spin_lock_bh+0x5c/0x98
    [] page_table_free+0x48/0x1a8
    [] do_fault+0xdc/0x670
    [] __handle_mm_fault+0x416/0x5f0
    [] handle_mm_fault+0x1b0/0x320
    [] do_dat_exception+0x19c/0x2c8
    [] pgm_check_handler+0x19e/0x200

    page_table_free() is called with NULL mm parameter, but because "0" is a
    valid address on s390 (see S390_lowcore), it keeps going until it
    eventually crashes in lockdep's lock_acquire. This crash is
    reproducible at least since 4.14.

    Problem is that "vmf->vma" used in do_fault() can become stale. Because
    mmap_sem may be released, other threads can come in, call munmap() and
    cause "vma" be returned to kmem cache, and get zeroed/re-initialized and
    re-used:

    handle_mm_fault |
    __handle_mm_fault |
    do_fault |
    vma = vmf->vma |
    do_read_fault |
    __do_fault |
    vma->vm_ops->fault(vmf); |
    mmap_sem is released |
    |
    | do_munmap()
    | remove_vma_list()
    | remove_vma()
    | vm_area_free()
    | # vma is released
    | ...
    | # same vma is allocated
    | # from kmem cache
    | do_mmap()
    | vm_area_alloc()
    | memset(vma, 0, ...)
    |
    pte_free(vma->vm_mm, ...); |
    page_table_free |
    spin_lock_bh(&mm->context.lock);|
    |

    Cache mm_struct to avoid using potentially stale "vma".

    [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c

    Link: http://lkml.kernel.org/r/5b3fdf19e2a5be460a384b936f5b56e13733f1b8.1551595137.git.jstancek@redhat.com
    Signed-off-by: Jan Stancek
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Matthew Wilcox
    Acked-by: Rafael Aquini
    Reviewed-by: Minchan Kim
    Acked-by: Kirill A. Shutemov
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Huang Ying
    Cc: Souptick Joarder
    Cc: Jerome Glisse
    Cc: Aneesh Kumar K.V
    Cc: David Hildenbrand
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jan Stancek
     

17 Jan, 2019

1 commit

  • commit 63f3655f950186752236bb88a22f8252c11ce394 upstream.

    Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
    ext4 writeback

    task1:
    wait_on_page_bit+0x82/0xa0
    shrink_page_list+0x907/0x960
    shrink_inactive_list+0x2c7/0x680
    shrink_node_memcg+0x404/0x830
    shrink_node+0xd8/0x300
    do_try_to_free_pages+0x10d/0x330
    try_to_free_mem_cgroup_pages+0xd5/0x1b0
    try_charge+0x14d/0x720
    memcg_kmem_charge_memcg+0x3c/0xa0
    memcg_kmem_charge+0x7e/0xd0
    __alloc_pages_nodemask+0x178/0x260
    alloc_pages_current+0x95/0x140
    pte_alloc_one+0x17/0x40
    __pte_alloc+0x1e/0x110
    alloc_set_pte+0x5fe/0xc20
    do_fault+0x103/0x970
    handle_mm_fault+0x61e/0xd10
    __do_page_fault+0x252/0x4d0
    do_page_fault+0x30/0x80
    page_fault+0x28/0x30

    task2:
    __lock_page+0x86/0xa0
    mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
    ext4_writepages+0x479/0xd60
    do_writepages+0x1e/0x30
    __writeback_single_inode+0x45/0x320
    writeback_sb_inodes+0x272/0x600
    __writeback_inodes_wb+0x92/0xc0
    wb_writeback+0x268/0x300
    wb_workfn+0xb4/0x390
    process_one_work+0x189/0x420
    worker_thread+0x4e/0x4b0
    kthread+0xe6/0x100
    ret_from_fork+0x41/0x50

    He adds
    "task1 is waiting for the PageWriteback bit of the page that task2 has
    collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
    LOCKED bit the page which tasks1 has locked"

    More precisely task1 is handling a page fault and it has a page locked
    while it charges a new page table to a memcg. That in turn hits a
    memory limit reclaim and the memcg reclaim for legacy controller is
    waiting on the writeback but that is never going to finish because the
    writeback itself is waiting for the page locked in the #PF path. So
    this is essentially ABBA deadlock:

    lock_page(A)
    SetPageWriteback(A)
    unlock_page(A)
    lock_page(B)
    lock_page(B)
    pte_alloc_pne
    shrink_page_list
    wait_on_page_writeback(A)
    SetPageWriteback(B)
    unlock_page(B)

    # flush A, B to clear the writeback

    This accumulating of more pages to flush is used by several filesystems
    to generate a more optimal IO patterns.

    Waiting for the writeback in legacy memcg controller is a workaround for
    pre-mature OOM killer invocations because there is no dirty IO
    throttling available for the controller. There is no easy way around
    that unfortunately. Therefore fix this specific issue by pre-allocating
    the page table outside of the page lock. We have that handy
    infrastructure for that already so simply reuse the fault-around pattern
    which already does this.

    There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
    from under a fs page locked but they should be really rare. I am not
    aware of a better solution unfortunately.

    [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@kernel.org: enhance comment, per Johannes]
    Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
    Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
    Signed-off-by: Michal Hocko
    Reported-by: Liu Bo
    Debugged-by: Liu Bo
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Reviewed-by: Liu Bo
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Vladimir Davydov
    Cc: Shakeel Butt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

01 Dec, 2018

1 commit

  • commit ff09d7ec9786be4ad7589aa987d7dc66e2dd9160 upstream.

    We clear the pte temporarily during read/modify/write update of the pte.
    If we take a page fault while the pte is cleared, the application can get
    SIGBUS. One such case is with remap_pfn_range without a backing
    vm_ops->fault callback. do_fault will return SIGBUS in that case.

    cpu 0 cpu1
    mprotect()
    ptep_modify_prot_start()/pte cleared.
    .
    . page fault.
    .
    .
    prep_modify_prot_commit()

    Fix this by taking page table lock and rechecking for pte_none.

    [aneesh.kumar@linux.ibm.com: fix crash observed with syzkaller run]
    Link: http://lkml.kernel.org/r/87va6bwlfg.fsf@linux.ibm.com
    Link: http://lkml.kernel.org/r/20180926031858.9692-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Cc: Willem de Bruijn
    Cc: Eric Dumazet
    Cc: Ido Schimmel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     

26 Aug, 2018

1 commit

  • This is not normally noticeable, but repeated forks are unnecessarily
    expensive because they repeatedly dirty the parent page tables during
    the page table copy operation.

    It's trivial to just avoid write protecting the page table entry if it
    was already not writable.

    This patch was inspired by

    https://bugzilla.kernel.org/show_bug.cgi?id=200447

    which points to an ancient "waste time re-doing fork" issue in the
    presence of lots of signals.

    That bug was fixed by Eric Biederman's signal handling series
    culminating in commit c3ad2c3b02e9 ("signal: Don't restart fork when
    signals come in"), but the unnecessary work for repeated forks is still
    work just fixing, particularly since the fix is trivial.

    Cc: Eric Biederman
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Aug, 2018

6 commits

  • Merge yet more updates from Andrew Morton:

    - the rest of MM

    - various misc fixes and tweaks

    * emailed patches from Andrew Morton : (22 commits)
    mm: Change return type int to vm_fault_t for fault handlers
    lib/fonts: convert comments to utf-8
    s390: ebcdic: convert comments to UTF-8
    treewide: convert ISO_8859-1 text comments to utf-8
    drivers/gpu/drm/gma500/: change return type to vm_fault_t
    docs/core-api: mm-api: add section about GFP flags
    docs/mm: make GFP flags descriptions usable as kernel-doc
    docs/core-api: split memory management API to a separate file
    docs/core-api: move *{str,mem}dup* to "String Manipulation"
    docs/core-api: kill trailing whitespace in kernel-api.rst
    mm/util: add kernel-doc for kvfree
    mm/util: make strndup_user description a kernel-doc comment
    fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
    treewide: correct "differenciate" and "instanciate" typos
    fs/afs: use new return type vm_fault_t
    drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
    mm: soft-offline: close the race against page allocation
    mm: fix race on soft-offlining free huge pages
    namei: allow restricted O_CREAT of FIFOs and regular files
    hfs: prevent crash on exit from failed search
    ...

    Linus Torvalds
     
  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    The aim is to change the return type of finish_fault() and
    handle_mm_fault() to vm_fault_t type. As part of that clean up return
    type of all other recursively called functions have been changed to
    vm_fault_t type.

    The places from where handle_mm_fault() is getting invoked will be
    change to vm_fault_t type but in a separate patch.

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
    Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • The generic tlb_end_vma does not call invalidate_range mmu notifier, and
    it resets resets the mmu_gather range, which means the notifier won't be
    called on part of the range in case of an unmap that spans multiple
    vmas.

    ARM64 seems to be the only arch I could see that has notifiers and uses
    the generic tlb_end_vma. I have not actually tested it.

    [ Catalin and Will point out that ARM64 currently only uses the
    notifiers for KVM, which doesn't use the ->invalidate_range()
    callback right now, so it's a bug, but one that happens to
    not affect them. So not necessary for stable. - Linus ]

    Signed-off-by: Nicholas Piggin
    Acked-by: Catalin Marinas
    Acked-by: Will Deacon
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • Jann reported that x86 was missing required TLB invalidates when he
    hit the !*batch slow path in tlb_remove_table().

    This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
    invalidates, the PowerPC-hash where this code originated and the
    Sparc-hash where this was subsequently used did not need that. ARM
    which later used this put an explicit TLB invalidate in their
    __p*_free_tlb() functions, and PowerPC-radix followed that example.

    But when we hooked up x86 we failed to consider this. Fix this by
    (optionally) hooking tlb_remove_table() into the TLB invalidate code.

    NOTE: s390 was also needing something like this and might now
    be able to use the generic code again.

    [ Modified to be on top of Nick's cleanups, which simplified this patch
    now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]

    Fixes: 9e52fc2b50de ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")
    Reported-by: Jann Horn
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Cc: Nicholas Piggin
    Cc: David Miller
    Cc: Will Deacon
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Will noted that only checking mm_users is incorrect; we should also
    check mm_count in order to cover CPUs that have a lazy reference to
    this mm (and could do speculative TLB operations).

    If removing this turns out to be a performance issue, we can
    re-instate a more complete check, but in tlb_table_flush() eliding the
    call_rcu_sched().

    Fixes: 267239116987 ("mm, powerpc: move the RCU page-table freeing into generic code")
    Reported-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Acked-by: Will Deacon
    Cc: Nicholas Piggin
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • There is no need to call this from tlb_flush_mmu_tlbonly, it logically
    belongs with tlb_flush_mmu_free. This makes future fixes simpler.

    [ This was originally done to allow code consolidation for the
    mmu_notifier fix, but it also ends up helping simplify the
    HAVE_RCU_TABLE_INVALIDATE fix. - Linus ]

    Signed-off-by: Nicholas Piggin
    Acked-by: Will Deacon
    Cc: Peter Zijlstra
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

23 Aug, 2018

1 commit

  • Revert commits:

    95b0e6357d3e x86/mm/tlb: Always use lazy TLB mode
    64482aafe55f x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
    ac0315896970 x86/mm/tlb: Make lazy TLB mode lazier
    61d0beb5796a x86/mm/tlb: Restructure switch_mm_irqs_off()
    2ff6ddf19c0e x86/mm/tlb: Leave lazy TLB mode at page table free time

    In order to simplify the TLB invalidate fixes for x86 and unify the
    parts that need backporting. We'll try again later.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

18 Aug, 2018

6 commits

  • There was a bug in Linux that could cause madvise (and mprotect?) system
    calls to return to userspace without the TLB having been flushed for all
    the pages involved.

    This could happen when multiple threads of a process made simultaneous
    madvise and/or mprotect calls.

    This was noticed in the summer of 2017, at which time two solutions
    were created:

    56236a59556c ("mm: refactor TLB gathering API")
    99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
    and
    4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range")

    We need only one of these solutions, and the former appears to be a
    little more efficient than the latter, so revert that one.

    This reverts 4647706ebeee6e50 ("mm: always flush VMA ranges affected by
    zap_page_range")

    Link: http://lkml.kernel.org/r/20180706131019.51e3a5f0@imladris.surriel.com
    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: "Aneesh Kumar K.V"
    Cc: Nicholas Piggin
    Cc: Nadav Amit
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
    callstack on OOM") has changed the ENOMEM semantic of memcg charges.
    Rather than invoking the oom killer from the charging context it delays
    the oom killer to the page fault path (pagefault_out_of_memory). This
    in turn means that many users (e.g. slab or g-u-p) will get ENOMEM when
    the corresponding memcg hits the hard limit and the memcg is is OOM.
    This is behavior is inconsistent with !memcg case where the oom killer
    is invoked from the allocation context and the allocator keeps retrying
    until it succeeds.

    The difference in the behavior is user visible. mmap(MAP_POPULATE)
    might result in not fully populated ranges while the mmap return code
    doesn't tell that to the userspace. Random syscalls might fail with
    ENOMEM etc.

    The primary motivation of the different memcg oom semantic was the
    deadlock avoidance. Things have changed since then, though. We have an
    async oom teardown by the oom reaper now and so we do not have to rely
    on the victim to tear down its memory anymore. Therefore we can return
    to the original semantic as long as the memcg oom killer is not handed
    over to the users space.

    There is still one thing to be careful about here though. If the oom
    killer is not able to make any forward progress - e.g. because there is
    no eligible task to kill - then we have to bail out of the charge path
    to prevent from same class of deadlocks. We have basically two options
    here. Either we fail the charge with ENOMEM or force the charge and
    allow overcharge. The first option has been considered more harmful
    than useful because rare inconsistencies in the ENOMEM behavior is hard
    to test for and error prone. Basically the same reason why the page
    allocator doesn't fail allocations under such conditions. The later
    might allow runaways but those should be really unlikely unless somebody
    misconfigures the system. E.g. allowing to migrate tasks away from the
    memcg to a different unlimited memcg with move_charge_at_immigrate
    disabled.

    Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Huge page helps to reduce TLB miss rate, but it has higher cache
    footprint, sometimes this may cause some issue. For example, when
    copying huge page on x86_64 platform, the cache footprint is 4M. But on
    a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
    (last level cache). That is, in average, there are 2.5M LLC for each
    core and 1.25M LLC for each thread.

    If the cache contention is heavy when copying the huge page, and we copy
    the huge page from the begin to the end, it is possible that the begin
    of huge page is evicted from the cache after we finishing copying the
    end of the huge page. And it is possible for the application to access
    the begin of the huge page after copying the huge page.

    In c79b57e462b5d ("mm: hugetlb: clear target sub-page last when clearing
    huge page"), to keep the cache lines of the target subpage hot, the
    order to clear the subpages in the huge page in clear_huge_page() is
    changed to clearing the subpage which is furthest from the target
    subpage firstly, and the target subpage last. The similar order
    changing helps huge page copying too. That is implemented in this
    patch. Because we have put the order algorithm into a separate
    function, the implementation is quite simple.

    The patch is a generic optimization which should benefit quite some
    workloads, not for a specific use case. To demonstrate the performance
    benefit of the patch, we tested it with vm-scalability run on
    transparent huge page.

    With this patch, the throughput increases ~16.6% in vm-scalability
    anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
    system (36 cores, 72 threads). The test case set
    /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
    anonymous memory area and populate it, then forked 36 child processes,
    each writes to the anonymous memory area from the begin to the end, so
    cause copy on write. For each child process, other child processes
    could be seen as other workloads which generate heavy cache pressure.
    At the same time, the IPC (instruction per cycle) increased from 0.63 to
    0.78, and the time spent in user space is reduced ~7.2%.

    Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Mike Kravetz
    Cc: Andi Kleen
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Christopher Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Patch series "mm, huge page: Copy target sub-page last when copy huge
    page", v2.

    Huge page helps to reduce TLB miss rate, but it has higher cache
    footprint, sometimes this may cause some issue. For example, when
    copying huge page on x86_64 platform, the cache footprint is 4M. But on
    a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
    (last level cache). That is, in average, there are 2.5M LLC for each
    core and 1.25M LLC for each thread.

    If the cache contention is heavy when copying the huge page, and we copy
    the huge page from the begin to the end, it is possible that the begin
    of huge page is evicted from the cache after we finishing copying the
    end of the huge page. And it is possible for the application to access
    the begin of the huge page after copying the huge page.

    In c79b57e462b5d ("mm: hugetlb: clear target sub-page last when clearing
    huge page"), to keep the cache lines of the target subpage hot, the
    order to clear the subpages in the huge page in clear_huge_page() is
    changed to clearing the subpage which is furthest from the target
    subpage firstly, and the target subpage last. The similar order
    changing helps huge page copying too. That is implemented in this
    patchset.

    The patchset is a generic optimization which should benefit quite some
    workloads, not for a specific use case. To demonstrate the performance
    benefit of the patchset, we have tested it with vm-scalability run on
    transparent huge page.

    With this patchset, the throughput increases ~16.6% in vm-scalability
    anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
    system (36 cores, 72 threads). The test case set
    /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
    anonymous memory area and populate it, then forked 36 child processes,
    each writes to the anonymous memory area from the begin to the end, so
    cause copy on write. For each child process, other child processes
    could be seen as other workloads which generate heavy cache pressure.
    At the same time, the IPC (instruction per cycle) increased from 0.63 to
    0.78, and the time spent in user space is reduced ~7.2%.

    This patch (of 4):

    In c79b57e462b5d ("mm: hugetlb: clear target sub-page last when clearing
    huge page"), to keep the cache lines of the target subpage hot, the
    order to clear the subpages in the huge page in clear_huge_page() is
    changed to clearing the subpage which is furthest from the target
    subpage firstly, and the target subpage last. This optimization could
    be applied to copying huge page too with the same order algorithm. To
    avoid code duplication and reduce maintenance overhead, in this patch,
    the order algorithm is moved out of clear_huge_page() into a separate
    function: process_huge_page(). So that we can use it for copying huge
    page too.

    This will change the direct calls to clear_user_highpage() into the
    indirect calls. But with the proper inline support of the compilers,
    the indirect call will be optimized to be the direct call. Our tests
    show no performance change with the patch.

    This patch is a code cleanup without functionality change.

    Link: http://lkml.kernel.org/r/20180524005851.4079-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-by: Mike Kravetz
    Reviewed-by: Mike Kravetz
    Cc: Andi Kleen
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Christopher Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Since commit eca56ff906bd ("mm, shmem: add internal shmem resident
    memory accounting"), MM_SHMEMPAGES is added to separate the shmem
    accounting from regular files. So, all shmem pages should be accounted
    to MM_SHMEMPAGES instead of MM_FILEPAGES.

    And, normal 4K shmem pages have been accounted to MM_SHMEMPAGES, so
    shmem thp pages should be not treated differently. Account them to
    MM_SHMEMPAGES via mm_counter_file() since shmem pages are swap backed to
    keep consistent with normal 4K shmem pages.

    This will not change the rss counter of processes since shmem pages are
    still a part of it.

    The /proc/pid/status and /proc/pid/statm counters will however be more
    accurate wrt shmem usage, as originally intended. And as eca56ff906bd
    ("mm, shmem: add internal shmem resident memory accounting") mentioned,
    oom also could report more accurate "shmem-rss".

    Link: http://lkml.kernel.org/r/1529442518-17398-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • This patch is reworked from an earlier patch that Dan has posted:
    https://patchwork.kernel.org/patch/10131727/

    VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
    the memory page it is dealing with is not typical memory from the linear
    map. The get_user_pages_fast() path, since it does not resolve the vma,
    is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
    use that as a VM_MIXEDMAP replacement in some locations. In the cases
    where there is no pte to consult we fallback to using vma_is_dax() to
    detect the VM_MIXEDMAP special case.

    Now that we have explicit driver pfn_t-flag opt-in/opt-out for
    get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
    also means we no longer need to worry about safely manipulating vm_flags
    in a future where we support dynamically changing the dax mode of a
    file.

    DAX should also now be supported with madvise_behavior(), vma_merge(),
    and copy_page_range().

    This patch has been tested against ndctl unit test. It has also been
    tested against xfstests commit: 625515d using fake pmem created by
    memmap and no additional issues have been observed.

    Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Acked-by: Dan Williams
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

15 Aug, 2018

2 commits

  • Pull block updates from Jens Axboe:
    "First pull request for this merge window, there will also be a
    followup request with some stragglers.

    This pull request contains:

    - Fix for a thundering heard issue in the wbt block code (Anchal
    Agarwal)

    - A few NVMe pull requests:
    * Improved tracepoints (Keith)
    * Larger inline data support for RDMA (Steve Wise)
    * RDMA setup/teardown fixes (Sagi)
    * Effects log suppor for NVMe target (Chaitanya Kulkarni)
    * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
    * TP4004 (ANA) support (Christoph)
    * Various NVMe fixes

    - Block io-latency controller support. Much needed support for
    properly containing block devices. (Josef)

    - Series improving how we handle sense information on the stack
    (Kees)

    - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

    - Zoned device support for null_blk (Matias)

    - AIX partition fixes (Mauricio Faria de Oliveira)

    - DIF checksum code made generic (Max Gurtovoy)

    - Add support for discard in iostats (Michael Callahan / Tejun)

    - Set of updates for BFQ (Paolo)

    - Removal of async write support for bsg (Christoph)

    - Bio page dirtying and clone fixups (Christoph)

    - Set of bcache fix/changes (via Coly)

    - Series improving blk-mq queue setup/teardown speed (Ming)

    - Series improving merging performance on blk-mq (Ming)

    - Lots of other fixes and cleanups from a slew of folks"

    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
    blkcg: Make blkg_root_lookup() work for queues in bypass mode
    bcache: fix error setting writeback_rate through sysfs interface
    null_blk: add lock drop/acquire annotation
    Blk-throttle: reduce tail io latency when iops limit is enforced
    block: paride: pd: mark expected switch fall-throughs
    block: Ensure that a request queue is dissociated from the cgroup controller
    block: Introduce blk_exit_queue()
    blkcg: Introduce blkg_root_lookup()
    block: Remove two superfluous #include directives
    blk-mq: count the hctx as active before allocating tag
    block: bvec_nr_vecs() returns value for wrong slab
    bcache: trivial - remove tailing backslash in macro BTREE_FLAG
    bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
    bcache: set max writeback rate when I/O request is idle
    bcache: add code comments for bset.c
    bcache: fix mistaken comments in request.c
    bcache: fix mistaken code comments in bcache.h
    bcache: add a comment in super.c
    bcache: avoid unncessary cache prefetch bch_btree_node_get()
    bcache: display rate debug parameters to 0 when writeback is not running
    ...

    Linus Torvalds
     
  • Merge L1 Terminal Fault fixes from Thomas Gleixner:
    "L1TF, aka L1 Terminal Fault, is yet another speculative hardware
    engineering trainwreck. It's a hardware vulnerability which allows
    unprivileged speculative access to data which is available in the
    Level 1 Data Cache when the page table entry controlling the virtual
    address, which is used for the access, has the Present bit cleared or
    other reserved bits set.

    If an instruction accesses a virtual address for which the relevant
    page table entry (PTE) has the Present bit cleared or other reserved
    bits set, then speculative execution ignores the invalid PTE and loads
    the referenced data if it is present in the Level 1 Data Cache, as if
    the page referenced by the address bits in the PTE was still present
    and accessible.

    While this is a purely speculative mechanism and the instruction will
    raise a page fault when it is retired eventually, the pure act of
    loading the data and making it available to other speculative
    instructions opens up the opportunity for side channel attacks to
    unprivileged malicious code, similar to the Meltdown attack.

    While Meltdown breaks the user space to kernel space protection, L1TF
    allows to attack any physical memory address in the system and the
    attack works across all protection domains. It allows an attack of SGX
    and also works from inside virtual machines because the speculation
    bypasses the extended page table (EPT) protection mechanism.

    The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646

    The mitigations provided by this pull request include:

    - Host side protection by inverting the upper address bits of a non
    present page table entry so the entry points to uncacheable memory.

    - Hypervisor protection by flushing L1 Data Cache on VMENTER.

    - SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
    by offlining the sibling CPU threads. The knobs are available on
    the kernel command line and at runtime via sysfs

    - Control knobs for the hypervisor mitigation, related to L1D flush
    and SMT control. The knobs are available on the kernel command line
    and at runtime via sysfs

    - Extensive documentation about L1TF including various degrees of
    mitigations.

    Thanks to all people who have contributed to this in various ways -
    patches, review, testing, backporting - and the fruitful, sometimes
    heated, but at the end constructive discussions.

    There is work in progress to provide other forms of mitigations, which
    might be less horrible performance wise for a particular kind of
    workloads, but this is not yet ready for consumption due to their
    complexity and limitations"

    * 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
    x86/microcode: Allow late microcode loading with SMT disabled
    tools headers: Synchronise x86 cpufeatures.h for L1TF additions
    x86/mm/kmmio: Make the tracer robust against L1TF
    x86/mm/pat: Make set_memory_np() L1TF safe
    x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
    x86/speculation/l1tf: Invert all not present mappings
    cpu/hotplug: Fix SMT supported evaluation
    KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
    x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
    x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
    Documentation/l1tf: Remove Yonah processors from not vulnerable list
    x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
    x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
    x86: Don't include linux/irq.h from asm/hardirq.h
    x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
    x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
    x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
    x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
    x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
    cpu/hotplug: detect SMT disabled by BIOS
    ...

    Linus Torvalds
     

14 Aug, 2018

1 commit

  • Pull x86 mm updates from Thomas Gleixner:

    - Make lazy TLB mode even lazier to avoid pointless switch_mm()
    operations, which reduces CPU load by 1-2% for memcache workloads

    - Small cleanups and improvements all over the place

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/mm: Remove redundant check for kmem_cache_create()
    arm/asm/tlb.h: Fix build error implicit func declaration
    x86/mm/tlb: Make clear_asid_other() static
    x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()
    x86/mm/tlb: Always use lazy TLB mode
    x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
    x86/mm/tlb: Make lazy TLB mode lazier
    x86/mm/tlb: Restructure switch_mm_irqs_off()
    x86/mm/tlb: Leave lazy TLB mode at page table free time
    mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
    x86/mm: Add TLB purge to free pmd/pte page interfaces
    ioremap: Update pgtable free interfaces with addr
    x86/mm: Disable ioremap free page handling on x86-PAE

    Linus Torvalds
     

11 Aug, 2018

1 commit

  • ioremap_prot() can return NULL which could lead to an oops.

    Link: http://lkml.kernel.org/r/1533195441-58594-1-git-send-email-chenjie6@huawei.com
    Signed-off-by: chen jie
    Reviewed-by: Andrew Morton
    Cc: Li Zefan
    Cc: chenjie
    Cc: Yang Shi
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jie@chenjie6@huwei.com
     

02 Aug, 2018

1 commit

  • Delete the old VM_BUG_ON_VMA() from zap_pmd_range(), which asserted
    that mmap_sem must be held when splitting an "anonymous" vma there.
    Whether that's still strictly true nowadays is not entirely clear,
    but the danger of sometimes crashing on the BUG is now fairly clear.

    Even with the new stricter rules for anonymous vma marking, the
    condition it checks for can possible trigger. Commit 44960f2a7b63
    ("staging: ashmem: Fix SIGBUS crash when traversing mmaped ashmem
    pages") is good, and originally I thought it was safe from that
    VM_BUG_ON_VMA(), because the /dev/ashmem fd exposed to the user is
    disconnected from the vm_file in the vma, and madvise(,,MADV_REMOVE)
    insists on VM_SHARED.

    But after I read John's earlier mail, drawing attention to the
    vfs_fallocate() in there: I may be wrong, and I don't know if Android
    has THP in the config anyway, but it looks to me like an
    unmap_mapping_range() from ashmem's vfs_fallocate() could hit precisely
    the VM_BUG_ON_VMA(), once it's vma_is_anonymous().

    Signed-off-by: Hugh Dickins
    Cc: John Stultz
    Cc: Kirill Shutemov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Jul, 2018

1 commit

  • Andy discovered that speculative memory accesses while in lazy
    TLB mode can crash a system, when a CPU tries to dereference a
    speculative access using memory contents that used to be valid
    page table memory, but have since been reused for something else
    and point into la-la land.

    The latter problem can be prevented in two ways. The first is to
    always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
    the second one is to only send the TLB shootdown at page table
    freeing time.

    The second should result in fewer IPIs, since operationgs like
    mprotect and madvise are very common with some workloads, but
    do not involve page table freeing. Also, on munmap, batching
    of page table freeing covers much larger ranges of virtual
    memory than the batching of unmapped user pages.

    Tested-by: Song Liu
    Signed-off-by: Rik van Riel
    Acked-by: Dave Hansen
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: efault@gmx.de
    Cc: kernel-team@fb.com
    Cc: luto@kernel.org
    Link: http://lkml.kernel.org/r/20180716190337.26133-3-riel@surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

09 Jul, 2018

1 commit

  • Memory allocations can induce swapping via kswapd or direct reclaim. If
    we are having IO done for us by kswapd and don't actually go into direct
    reclaim we may never get scheduled for throttling. So instead check to
    see if our cgroup is congested, and if so schedule the throttling.
    Before we return to user space the throttling stuff will only throttle
    if we actually required it.

    Signed-off-by: Tejun Heo
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Tejun Heo
     

21 Jun, 2018

1 commit

  • For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
    table entry. This sets the high bits in the CPU's address space, thus
    making sure to point to not point an unmapped entry to valid cached memory.

    Some server system BIOSes put the MMIO mappings high up in the physical
    address space. If such an high mapping was mapped to unprivileged users
    they could attack low memory by setting such a mapping to PROT_NONE. This
    could happen through a special device driver which is not access
    protected. Normal /dev/mem is of course access protected.

    To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.

    Valid page mappings are allowed because the system is then unsafe anyways.

    It's not expected that users commonly use PROT_NONE on MMIO. But to
    minimize any impact this is only enforced if the mapping actually refers to
    a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
    the check for root.

    For mmaps this is straight forward and can be handled in vm_insert_pfn and
    in remap_pfn_range().

    For mprotect it's a bit trickier. At the point where the actual PTEs are
    accessed a lot of state has been changed and it would be difficult to undo
    on an error. Since this is a uncommon case use a separate early page talk
    walk pass for MMIO PROT_NONE mappings that checks for this condition
    early. For non MMIO and non PROT_NONE there are no changes.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Dave Hansen

    Andi Kleen
     

08 Jun, 2018

4 commits

  • Merge updates from Andrew Morton:

    - a few misc things

    - ocfs2 updates

    - v9fs updates

    - MM

    - procfs updates

    - lib/ updates

    - autofs updates

    * emailed patches from Andrew Morton : (118 commits)
    autofs: small cleanup in autofs_getpath()
    autofs: clean up includes
    autofs: comment on selinux changes needed for module autoload
    autofs: update MAINTAINERS entry for autofs
    autofs: use autofs instead of autofs4 in documentation
    autofs: rename autofs documentation files
    autofs: create autofs Kconfig and Makefile
    autofs: delete fs/autofs4 source files
    autofs: update fs/autofs4/Makefile
    autofs: update fs/autofs4/Kconfig
    autofs: copy autofs4 to autofs
    autofs4: use autofs instead of autofs4 everywhere
    autofs4: merge auto_fs.h and auto_fs4.h
    fs/binfmt_misc.c: do not allow offset overflow
    checkpatch: improve patch recognition
    lib/ucs2_string.c: add MODULE_LICENSE()
    lib/mpi: headers cleanup
    lib/percpu_ida.c: use _irqsave() instead of local_irq_save() + spin_lock
    lib/idr.c: remove simple_ida_lock
    lib/bitmap.c: micro-optimization for __bitmap_complement()
    ...

    Linus Torvalds
     
  • Remove the additional define HAVE_PTE_SPECIAL and rely directly on
    CONFIG_ARCH_HAS_PTE_SPECIAL.

    There is no functional change introduced by this patch

    Link: http://lkml.kernel.org/r/1523533733-25437-1-git-send-email-ldufour@linux.vnet.ibm.com
    Signed-off-by: Laurent Dufour
    Acked-by: David Rientjes
    Reviewed-by: Andrew Morton
    Cc: Jerome Glisse
    Cc: Michal Hocko
    Cc: Christophe LEROY
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • Currently the PTE special supports is turned on in per architecture
    header files. Most of the time, it is defined in
    arch/*/include/asm/pgtable.h depending or not on some other per
    architecture static definition.

    This patch introduce a new configuration variable to manage this
    directly in the Kconfig files. It would later replace
    __HAVE_ARCH_PTE_SPECIAL.

    Here notes for some architecture where the definition of
    __HAVE_ARCH_PTE_SPECIAL is not obvious:

    arm
    __HAVE_ARCH_PTE_SPECIAL which is currently defined in
    arch/arm/include/asm/pgtable-3level.h which is included by
    arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
    So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.

    powerpc
    __HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
    - arch/powerpc/include/asm/book3s/64/pgtable.h
    - arch/powerpc/include/asm/pte-common.h
    The first one is included if (PPC_BOOK3S & PPC64) while the second is
    included in all the other cases.
    So select ARCH_HAS_PTE_SPECIAL all the time.

    sparc:
    __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
    defined(__arch64__) which are defined through the compiler in
    sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
    So select ARCH_HAS_PTE_SPECIAL if SPARC64

    There is no functional change introduced by this patch.

    Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.com
    Signed-off-by: Laurent Dufour
    Suggested-by: Jerome Glisse
    Reviewed-by: Jerome Glisse
    Acked-by: David Rientjes
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K . V"
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: David S. Miller
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Vineet Gupta
    Cc: Palmer Dabbelt
    Cc: Albert Ou
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: David Rientjes
    Cc: Robin Murphy
    Cc: Christophe LEROY
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    There was an existing bug inside dax_load_hole() if vm_insert_mixed had
    failed to allocate a page table, we'd return VM_FAULT_NOPAGE instead of
    VM_FAULT_OOM. With new vmf_insert_mixed() this issue is addressed.

    vm_insert_mixed_mkwrite has inefficiency when it returns an error value,
    driver has to convert it to vm_fault_t type. With new
    vmf_insert_mixed_mkwrite() this limitation will be addressed.

    Link: http://lkml.kernel.org/r/20180510181121.GA15239@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Jan Kara
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

01 Jun, 2018

1 commit


06 Apr, 2018

2 commits

  • This patch makes do_swap_page() not need to be aware of two different
    swap readahead algorithms. Just unify cluster-based and vma-based
    readahead function call.

    Link: http://lkml.kernel.org/r/1509520520-32367-3-git-send-email-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20180220085249.151400-3-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When I see recent change of swap readahead, I am very unhappy about
    current code structure which diverges two swap readahead algorithm in
    do_swap_page. This patch is to clean it up.

    Main motivation is that fault handler doesn't need to be aware of
    readahead algorithms but just should call swapin_readahead.

    As first step, this patch cleans up a little bit but not perfect (I just
    separate for review easier) so next patch will make the goal complete.

    [minchan@kernel.org: do not check readahead flag with THP anon]
    Link: http://lkml.kernel.org/r/874lm83zho.fsf@yhuang-dev.intel.com
    Link: http://lkml.kernel.org/r/20180227232611.169883-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/1509520520-32367-2-git-send-email-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20180220085249.151400-2-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

18 Mar, 2018

1 commit

  • If a processor supports special metadata for a page, for example ADI
    version tags on SPARC M7, this metadata must be saved when the page is
    swapped out. The same metadata must be restored when the page is swapped
    back in. This patch adds two new architecture specific functions -
    arch_do_swap_page() to be called when a page is swapped in, and
    arch_unmap_one() to be called when a page is being unmapped for swap
    out. These architecture hooks allow page metadata to be saved if the
    architecture supports it.

    Signed-off-by: Khalid Aziz
    Cc: Khalid Aziz
    Acked-by: Jerome Marchand
    Reviewed-by: Anthony Yznaga
    Acked-by: Andrew Morton
    Signed-off-by: David S. Miller

    Khalid Aziz
     

17 Feb, 2018

1 commit

  • We get a warning about some slow configurations in randconfig kernels:

    mm/memory.c:83:2: error: #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. [-Werror=cpp]

    The warning is reasonable by itself, but gets in the way of randconfig
    build testing, so I'm hiding it whenever CONFIG_COMPILE_TEST is set.

    The warning was added in 2013 in commit 75980e97dacc ("mm: fold
    page->_last_nid into page->flags where possible").

    Cc: stable@vger.kernel.org
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

07 Feb, 2018

3 commits

  • Merge misc updates from Andrew Morton:

    - kasan updates

    - procfs

    - lib/bitmap updates

    - other lib/ updates

    - checkpatch tweaks

    - rapidio

    - ubsan

    - pipe fixes and cleanups

    - lots of other misc bits

    * emailed patches from Andrew Morton : (114 commits)
    Documentation/sysctl/user.txt: fix typo
    MAINTAINERS: update ARM/QUALCOMM SUPPORT patterns
    MAINTAINERS: update various PALM patterns
    MAINTAINERS: update "ARM/OXNAS platform support" patterns
    MAINTAINERS: update Cortina/Gemini patterns
    MAINTAINERS: remove ARM/CLKDEV SUPPORT file pattern
    MAINTAINERS: remove ANDROID ION pattern
    mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors
    mm: docs: fix parameter names mismatch
    mm: docs: fixup punctuation
    pipe: read buffer limits atomically
    pipe: simplify round_pipe_size()
    pipe: reject F_SETPIPE_SZ with size over UINT_MAX
    pipe: fix off-by-one error when checking buffer limits
    pipe: actually allow root to exceed the pipe buffer limits
    pipe, sysctl: remove pipe_proc_fn()
    pipe, sysctl: drop 'min' parameter from pipe-max-size converter
    kasan: rework Kconfig settings
    crash_dump: is_kdump_kernel can be boolean
    kernel/mutex: mutex_is_locked can be boolean
    ...

    Linus Torvalds
     
  • The file was converted from print_symbol() to %pSR a while ago in commit
    071361d3473e ("mm: Convert print_symbol to %pSR"). kallsyms does not
    seem to be needed anymore.

    Link: http://lkml.kernel.org/r/20171208025616.16267-3-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Pull libnvdimm updates from Ross Zwisler:

    - Require struct page by default for filesystem DAX to remove a number
    of surprising failure cases. This includes failures with direct I/O,
    gdb and fork(2).

    - Add support for the new Platform Capabilities Structure added to the
    NFIT in ACPI 6.2a. This new table tells us whether the platform
    supports flushing of CPU and memory controller caches on unexpected
    power loss events.

    - Revamp vmem_altmap and dev_pagemap handling to clean up code and
    better support future future PCI P2P uses.

    - Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has
    become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL
    spec, and instead rely on the generic ND_CMD_CALL approach used by
    the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}.

    - Enhance nfit_test so we can test some of the new things added in
    version 1.6 of the DSM specification. This includes testing firmware
    download and simulating the Last Shutdown State (LSS) status.

    * tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits)
    libnvdimm, namespace: remove redundant initialization of 'nd_mapping'
    acpi, nfit: fix register dimm error handling
    libnvdimm, namespace: make min namespace size 4K
    tools/testing/nvdimm: force nfit_test to depend on instrumented modules
    libnvdimm/nfit_test: adding support for unit testing enable LSS status
    libnvdimm/nfit_test: add firmware download emulation
    nfit-test: Add platform cap support from ACPI 6.2a to test
    libnvdimm: expose platform persistence attribute for nd_region
    acpi: nfit: add persistent memory control flag for nd_region
    acpi: nfit: Add support for detect platform CPU cache flush on power loss
    device-dax: Fix trailing semicolon
    libnvdimm, btt: fix uninitialized err_lock
    dax: require 'struct page' by default for filesystem dax
    ext2: auto disable dax instead of failing mount
    ext4: auto disable dax instead of failing mount
    mm, dax: introduce pfn_t_special()
    mm: Fix devm_memremap_pages() collision handling
    mm: Fix memory size alignment in devm_memremap_pages_release()
    memremap: merge find_dev_pagemap into get_dev_pagemap
    memremap: change devm_memremap_pages interface to use struct dev_pagemap
    ...

    Linus Torvalds
     

01 Feb, 2018

3 commits

  • There are multiple comments surrounding do_fault_around that memtion
    fault_around_pages() and fault_around_mask(), two routines that do not
    exist. These comments should be reworded to reference
    fault_around_bytes, the value which is used to determine how much
    do_fault_around() will attempt to read when processing a fault.

    These comments should have been updated when fault_around_pages() and
    fault_around_mask() were removed in commit aecd6f44266c ("mm: close race
    between do_fault_around() and fault_around_bytes_set()").

    Fixes: aecd6f44266c1 ("mm: close race between do_fault_around() and fault_around_bytes_set()")
    Link: http://lkml.kernel.org/r/302D0B14-C7E9-44C6-8BED-033F9ACBD030@oracle.com
    Signed-off-by: William Kucharski
    Reviewed-by: Larry Bassel
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    William Kucharski
     
  • Several users of unmap_mapping_range() would prefer to express their
    range in pages rather than bytes. Unfortuately, on a 32-bit kernel, you
    have to remember to cast your page number to a 64-bit type before
    shifting it, and four places in the current tree didn't remember to do
    that. That's a sign of a bad interface.

    Conveniently, unmap_mapping_range() actually converts from bytes into
    pages, so hoist the guts of unmap_mapping_range() into a new function
    unmap_mapping_pages() and convert the callers which want to use pages.

    Link: http://lkml.kernel.org/r/20171206142627.GD32044@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Reported-by: "zhangyi (F)"
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • The comment describes @fullmm argument, but the function has no such
    parameter.

    Update the comment to match the code and convert it to kernel-doc
    markup.

    Link: http://lkml.kernel.org/r/1512394531-2264-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport