10 Sep, 2018

1 commit

  • commit a6f572084fbee8b30f91465f4a085d7a90901c57 upstream.

    Will noted that only checking mm_users is incorrect; we should also
    check mm_count in order to cover CPUs that have a lazy reference to
    this mm (and could do speculative TLB operations).

    If removing this turns out to be a performance issue, we can
    re-instate a more complete check, but in tlb_table_flush() eliding the
    call_rcu_sched().

    Fixes: 267239116987 ("mm, powerpc: move the RCU page-table freeing into generic code")
    Reported-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Acked-by: Will Deacon
    Cc: Nicholas Piggin
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

05 Sep, 2018

4 commits

  • commit d86564a2f085b79ec046a5cba90188e612352806 upstream.

    Jann reported that x86 was missing required TLB invalidates when he
    hit the !*batch slow path in tlb_remove_table().

    This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
    invalidates, the PowerPC-hash where this code originated and the
    Sparc-hash where this was subsequently used did not need that. ARM
    which later used this put an explicit TLB invalidate in their
    __p*_free_tlb() functions, and PowerPC-radix followed that example.

    But when we hooked up x86 we failed to consider this. Fix this by
    (optionally) hooking tlb_remove_table() into the TLB invalidate code.

    NOTE: s390 was also needing something like this and might now
    be able to use the generic code again.

    [ Modified to be on top of Nick's cleanups, which simplified this patch
    now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]

    Fixes: 9e52fc2b50de ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")
    Reported-by: Jann Horn
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Cc: Nicholas Piggin
    Cc: David Miller
    Cc: Will Deacon
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit db7ddef301128dad394f1c0f77027f86ee9a4edb upstream.

    There is no need to call this from tlb_flush_mmu_tlbonly, it logically
    belongs with tlb_flush_mmu_free. This makes future fixes simpler.

    [ This was originally done to allow code consolidation for the
    mmu_notifier fix, but it also ends up helping simplify the
    HAVE_RCU_TABLE_INVALIDATE fix. - Linus ]

    Signed-off-by: Nicholas Piggin
    Acked-by: Will Deacon
    Cc: Peter Zijlstra
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     
  • [ Upstream commit 24eee1e4c47977bdfb71d6f15f6011e7b6188d04 ]

    ioremap_prot() can return NULL which could lead to an oops.

    Link: http://lkml.kernel.org/r/1533195441-58594-1-git-send-email-chenjie6@huawei.com
    Signed-off-by: chen jie
    Reviewed-by: Andrew Morton
    Cc: Li Zefan
    Cc: chenjie
    Cc: Yang Shi
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    jie@chenjie6@huwei.com
     
  • [ Upstream commit 53406ed1bcfdabe4b5bc35e6d17946c6f9f563e2 ]

    Delete the old VM_BUG_ON_VMA() from zap_pmd_range(), which asserted
    that mmap_sem must be held when splitting an "anonymous" vma there.
    Whether that's still strictly true nowadays is not entirely clear,
    but the danger of sometimes crashing on the BUG is now fairly clear.

    Even with the new stricter rules for anonymous vma marking, the
    condition it checks for can possible trigger. Commit 44960f2a7b63
    ("staging: ashmem: Fix SIGBUS crash when traversing mmaped ashmem
    pages") is good, and originally I thought it was safe from that
    VM_BUG_ON_VMA(), because the /dev/ashmem fd exposed to the user is
    disconnected from the vm_file in the vma, and madvise(,,MADV_REMOVE)
    insists on VM_SHARED.

    But after I read John's earlier mail, drawing attention to the
    vfs_fallocate() in there: I may be wrong, and I don't know if Android
    has THP in the config anyway, but it looks to me like an
    unmap_mapping_range() from ashmem's vfs_fallocate() could hit precisely
    the VM_BUG_ON_VMA(), once it's vma_is_anonymous().

    Signed-off-by: Hugh Dickins
    Cc: John Stultz
    Cc: Kirill Shutemov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     

16 Aug, 2018

1 commit

  • commit 42e4089c7890725fcd329999252dc489b72f2921 upstream

    For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
    table entry. This sets the high bits in the CPU's address space, thus
    making sure to point to not point an unmapped entry to valid cached memory.

    Some server system BIOSes put the MMIO mappings high up in the physical
    address space. If such an high mapping was mapped to unprivileged users
    they could attack low memory by setting such a mapping to PROT_NONE. This
    could happen through a special device driver which is not access
    protected. Normal /dev/mem is of course access protected.

    To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.

    Valid page mappings are allowed because the system is then unsafe anyways.

    It's not expected that users commonly use PROT_NONE on MMIO. But to
    minimize any impact this is only enforced if the mapping actually refers to
    a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
    the check for root.

    For mmaps this is straight forward and can be handled in vm_insert_pfn and
    in remap_pfn_range().

    For mprotect it's a bit trickier. At the point where the actual PTEs are
    accessed a lot of state has been changed and it would be difficult to undo
    on an error. Since this is a uncommon case use a separate early page talk
    walk pass for MMIO PROT_NONE mappings that checks for this condition
    early. For non MMIO and non PROT_NONE there are no changes.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Dave Hansen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

22 Feb, 2018

1 commit

  • commit af27d9403f5b80685b79c88425086edccecaf711 upstream.

    We get a warning about some slow configurations in randconfig kernels:

    mm/memory.c:83:2: error: #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. [-Werror=cpp]

    The warning is reasonable by itself, but gets in the way of randconfig
    build testing, so I'm hiding it whenever CONFIG_COMPILE_TEST is set.

    The warning was added in 2013 in commit 75980e97dacc ("mm: fold
    page->_last_nid into page->flags where possible").

    Cc: stable@vger.kernel.org
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     

04 Oct, 2017

1 commit

  • With device public pages at the end of my memory space, I'm getting
    output from _vm_normal_page():

    BUG: Bad page map in process migrate_pages pte:c0800001ffff0d06 pmd:f95d3000
    addr:00007fff89330000 vm_flags:00100073 anon_vma:c0000000fa899320 mapping: (null) index:7fff8933
    file: (null) fault: (null) mmap: (null) readpage: (null)
    CPU: 0 PID: 13963 Comm: migrate_pages Tainted: P B OE 4.14.0-rc1-wip #155
    Call Trace:
    dump_stack+0xb0/0xf4 (unreliable)
    print_bad_pte+0x28c/0x340
    _vm_normal_page+0xc0/0x140
    zap_pte_range+0x664/0xc10
    unmap_page_range+0x318/0x670
    unmap_vmas+0x74/0xe0
    exit_mmap+0xe8/0x1f0
    mmput+0xac/0x1f0
    do_exit+0x348/0xcd0
    do_group_exit+0x5c/0xf0
    SyS_exit_group+0x1c/0x20
    system_call+0x58/0x6c

    The pfn causing this is the very last one. Correct the bounds check
    accordingly.

    Fixes: df6ad69838fc ("mm/device-public-memory: device memory cache coherent with CPU")
    Link: http://lkml.kernel.org/r/1506092178-20351-1-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Reviewed-by: Jérôme Glisse
    Reviewed-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     

09 Sep, 2017

6 commits

  • Allow interval trees to quickly check for overlaps to avoid unnecesary
    tree lookups in interval_tree_iter_first().

    As of this patch, all interval tree flavors will require using a
    'rb_root_cached' such that we can have the leftmost node easily
    available. While most users will make use of this feature, those with
    special functions (in addition to the generic insert, delete, search
    calls) will avoid using the cached option as they can do funky things
    with insertions -- for example, vma_interval_tree_insert_after().

    [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
    Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Doug Ledford
    Acked-by: Michael S. Tsirkin
    Cc: David Airlie
    Cc: Jason Wang
    Cc: Christian Benvenuti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Seen while reading the code, in handle_mm_fault(), in the case
    arch_vma_access_permitted() is failing the call to
    mem_cgroup_oom_disable() is not made.

    To fix that, move the call to mem_cgroup_oom_enable() after calling
    arch_vma_access_permitted() as it should not have entered the memcg OOM.

    Link: http://lkml.kernel.org/r/1504625439-31313-1-git-send-email-ldufour@linux.vnet.ibm.com
    Fixes: bae473a423f6 ("mm: introduce fault_env")
    Signed-off-by: Laurent Dufour
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • Flags argument has been copied into vmf.flags and it is not changed in
    between. Hence a single write access check can be used for both PUD and
    PMD.

    Link: http://lkml.kernel.org/r/20170823082839.1812-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • HMM (heterogeneous memory management) need struct page to support
    migration from system main memory to device memory. Reasons for HMM and
    migration to device memory is explained with HMM core patch.

    This patch deals with device memory that is un-addressable memory (ie CPU
    can not access it). Hence we do not want those struct page to be manage
    like regular memory. That is why we extend ZONE_DEVICE to support
    different types of memory.

    A persistent memory type is define for existing user of ZONE_DEVICE and a
    new device un-addressable type is added for the un-addressable memory
    type. There is a clear separation between what is expected from each
    memory type and existing user of ZONE_DEVICE are un-affected by new
    requirement and new use of the un-addressable type. All specific code
    path are protect with test against the memory type.

    Because memory is un-addressable we use a new special swap type for when a
    page is migrated to device memory (this reduces the number of maximum swap
    file).

    The main two additions beside memory type to ZONE_DEVICE is two callbacks.
    First one, page_free() is call whenever page refcount reach 1 (which
    means the page is free as ZONE_DEVICE page never reach a refcount of 0).
    This allow device driver to manage its memory and associated struct page.

    The second callback page_fault() happens when there is a CPU access to an
    address that is back by a device page (which are un-addressable by the
    CPU). This callback is responsible to migrate the page back to system
    main memory. Device driver can not block migration back to system memory,
    HMM make sure that such page can not be pin into device memory.

    If device is in some error condition and can not migrate memory back then
    a CPU page fault to device memory should end with SIGBUS.

    [arnd@arndb.de: fix warning]
    Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Arnd Bergmann
    Acked-by: Dan Williams
    Cc: Ross Zwisler
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • When THP migration is being used, memory management code needs to handle
    pmd migration entries properly. This patch uses !pmd_present() or
    is_swap_pmd() (depending on whether pmd_none() needs separate code or
    not) to check pmd migration entries at the places where a pmd entry is
    present.

    Since pmd-related code uses split_huge_page(), split_huge_pmd(),
    pmd_trans_huge(), pmd_trans_unstable(), or
    pmd_none_or_trans_huge_or_clear_bad(), this patch:

    1. adds pmd migration entry split code in split_huge_pmd(),

    2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

    3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

    Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
    is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
    them.

    Until this commit, a pmd entry should be:
    1. pointing to a pte page,
    2. is_swap_pmd(),
    3. pmd_trans_huge(),
    4. pmd_devmap(), or
    5. pmd_none().

    Signed-off-by: Zi Yan
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     

07 Sep, 2017

5 commits

  • Huge page helps to reduce TLB miss rate, but it has higher cache
    footprint, sometimes this may cause some issue. For example, when
    clearing huge page on x86_64 platform, the cache footprint is 2M. But
    on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
    LLC (last level cache). That is, in average, there are 2.5M LLC for
    each core and 1.25M LLC for each thread.

    If the cache pressure is heavy when clearing the huge page, and we clear
    the huge page from the begin to the end, it is possible that the begin
    of huge page is evicted from the cache after we finishing clearing the
    end of the huge page. And it is possible for the application to access
    the begin of the huge page after clearing the huge page.

    To help the above situation, in this patch, when we clear a huge page,
    the order to clear sub-pages is changed. In quite some situation, we
    can get the address that the application will access after we clear the
    huge page, for example, in a page fault handler. Instead of clearing
    the huge page from begin to end, we will clear the sub-pages farthest
    from the the sub-page to access firstly, and clear the sub-page to
    access last. This will make the sub-page to access most cache-hot and
    sub-pages around it more cache-hot too. If we cannot know the address
    the application will access, the begin of the huge page is assumed to be
    the the address the application will access.

    With this patch, the throughput increases ~28.3% in vm-scalability
    anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
    system (36 cores, 72 threads). The test case creates 72 processes, each
    process mmap a big anonymous memory area and writes to it from the begin
    to the end. For each process, other processes could be seen as other
    workload which generates heavy cache pressure. At the same time, the
    cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per
    cycle) increased from 0.56 to 0.74, and the time spent in user space is
    reduced ~7.9%

    Christopher Lameter suggests to clear bytes inside a sub-page from end
    to begin too. But tests show no visible performance difference in the
    tests. May because the size of page is small compared with the cache
    size.

    Thanks Andi Kleen to propose to use address to access to determine the
    order of sub-pages to clear.

    The hugetlbfs access address could be improved, will do that in another
    patch.

    [ying.huang@intel.com: improve readability of clear_huge_page()]
    Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com
    Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.com
    Suggested-by: Andi Kleen
    Signed-off-by: "Huang, Ying"
    Acked-by: Jan Kara
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Nadia Yvette Chambers
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Christopher Lameter
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The swap readahead is an important mechanism to reduce the swap in
    latency. Although pure sequential memory access pattern isn't very
    popular for anonymous memory, the space locality is still considered
    valid.

    In the original swap readahead implementation, the consecutive blocks in
    swap device are readahead based on the global space locality estimation.
    But the consecutive blocks in swap device just reflect the order of page
    reclaiming, don't necessarily reflect the access pattern in virtual
    memory. And the different tasks in the system may have different access
    patterns, which makes the global space locality estimation incorrect.

    In this patch, when page fault occurs, the virtual pages near the fault
    address will be readahead instead of the swap slots near the fault swap
    slot in swap device. This avoid to readahead the unrelated swap slots.
    At the same time, the swap readahead is changed to work on per-VMA from
    globally. So that the different access patterns of the different VMAs
    could be distinguished, and the different readahead policy could be
    applied accordingly. The original core readahead detection and scaling
    algorithm is reused, because it is an effect algorithm to detect the
    space locality.

    The test and result is as follow,

    Common test condition
    =====================

    Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
    NVMe disk

    Micro-benchmark with combined access pattern
    ============================================

    vm-scalability, sequential swap test case, 4 processes to eat 50G
    virtual memory space, repeat the sequential memory writing until 300
    seconds. The first round writing will trigger swap out, the following
    rounds will trigger sequential swap in and out.

    At the same time, run vm-scalability random swap test case in
    background, 8 processes to eat 30G virtual memory space, repeat the
    random memory write until 300 seconds. This will trigger random swap-in
    in the background.

    This is a combined workload with sequential and random memory accessing
    at the same time. The result (for sequential workload) is as follow,

    Base Optimized
    ---- ---------
    throughput 345413 KB/s 414029 KB/s (+19.9%)
    latency.average 97.14 us 61.06 us (-37.1%)
    latency.50th 2 us 1 us
    latency.60th 2 us 1 us
    latency.70th 98 us 2 us
    latency.80th 160 us 2 us
    latency.90th 260 us 217 us
    latency.95th 346 us 369 us
    latency.99th 1.34 ms 1.09 ms
    ra_hit% 52.69% 99.98%

    The original swap readahead algorithm is confused by the background
    random access workload, so readahead hit rate is lower. The VMA-base
    readahead algorithm works much better.

    Linpack
    =======

    The test memory size is bigger than RAM to trigger swapping.

    Base Optimized
    ---- ---------
    elapsed_time 393.49 s 329.88 s (-16.2%)
    ra_hit% 86.21% 98.82%

    The score of base and optimized kernel hasn't visible changes. But the
    elapsed time reduced and readahead hit rate improved, so the optimized
    kernel runs better for startup and tear down stages. And the absolute
    value of readahead hit rate is high, shows that the space locality is
    still valid in some practical workloads.

    Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • After supporting to delay THP (Transparent Huge Page) splitting after
    swapped out, it is possible that some page table mappings of the THP are
    turned into swap entries. So reuse_swap_page() need to check the swap
    count in addition to the map count as before. This patch done that.

    In the huge PMD write protect fault handler, in addition to the page map
    count, the swap count need to be checked too, so the page lock need to
    be acquired too when calling reuse_swap_page() in addition to the page
    table lock.

    [ying.huang@intel.com: silence a compiler warning]
    Link: http://lkml.kernel.org/r/87bmnzizjy.fsf@yhuang-dev.intel.com
    Link: http://lkml.kernel.org/r/20170724051840.2309-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Nadav Amit report zap_page_range only specifies that the caller protect
    the VMA list but does not specify whether it is held for read or write
    with callers using either. madvise holds mmap_sem for read meaning that
    a parallel zap operation can unmap PTEs which are then potentially
    skipped by madvise which potentially returns with stale TLB entries
    present. While the API could be extended, it would be a difficult API
    to use. This patch causes zap_page_range() to always consider flushing
    the full affected range. For small ranges or sparsely populated
    mappings, this may result in one additional spurious TLB flush. For
    larger ranges, it is possible that the TLB has already been flushed and
    the overhead is negligible. Either way, this approach is safer overall
    and avoids stale entries being present when madvise returns.

    This can be illustrated with the following program provided by Nadav
    Amit and slightly modified. With the patch applied, it has an exit code
    of 0 indicating a stale TLB entry did not leak to userspace.

    ---8<< 32);
    }

    static inline void wait_rdtsc(unsigned long cycles)
    {
    unsigned long tsc = rdtsc();

    while (rdtsc() - tsc < cycles);
    }

    void *big_madvise_thread(void *ign)
    {
    sync_step = 1;
    while (sync_step != 2);
    madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_DONTNEED);
    }

    int main(void)
    {
    pthread_t aux_thread;

    p = mmap(0, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE,
    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

    memset((void*)p, 8, PAGE_SIZE * N_PAGES);

    pthread_create(&aux_thread, NULL, big_madvise_thread, NULL);
    while (sync_step != 1);

    *p = 8; // Cache in TLB
    sync_step = 2;
    wait_rdtsc(100000);
    madvise((void*)p, PAGE_SIZE, MADV_DONTNEED);
    printf("data: %d (%s)\n", *p, (*p == 8 ? "stale, broken" : "cleared, fine"));
    return *p == 8 ? -1 : 0;
    }
    ---8
    Reported-by: Nadav Amit
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When servicing mmap() reads from file holes the current DAX code
    allocates a page cache page of all zeroes and places the struct page
    pointer in the mapping->page_tree radix tree. This has three major
    drawbacks:

    1) It consumes memory unnecessarily. For every 4k page that is read via
    a DAX mmap() over a hole, we allocate a new page cache page. This
    means that if you read 1GiB worth of pages, you end up using 1GiB of
    zeroed memory.

    2) It is slower than using a common zero page because each page fault
    has more work to do. Instead of just inserting a common zero page we
    have to allocate a page cache page, zero it, and then insert it.

    3) The fact that we had to check for both DAX exceptional entries and
    for page cache pages in the radix tree made the DAX code more
    complex.

    This series solves these issues by following the lead of the DAX PMD
    code and using a common 4k zero page instead. This reduces memory usage
    and decreases latencies for some workloads, and it simplifies the DAX
    code, removing over 100 lines in total.

    This patch (of 5):

    To be able to use the common 4k zero page in DAX we need to have our PTE
    fault path look more like our PMD fault path where a PTE entry can be
    marked as dirty and writeable as it is first inserted rather than
    waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault()
    call.

    Right now we can rely on having a dax_pfn_mkwrite() call because we can
    distinguish between these two cases in do_wp_page():

    case 1: 4k zero page => writable DAX storage
    case 2: read-only DAX storage => writeable DAX storage

    This distinction is made by via vm_normal_page(). vm_normal_page()
    returns false for the common 4k zero page, though, just as it does for
    DAX ptes. Instead of special casing the DAX + 4k zero page case we will
    simplify our DAX PTE page fault sequence so that it matches our DAX PMD
    sequence, and get rid of the dax_pfn_mkwrite() helper. We will instead
    use dax_iomap_fault() to handle write-protection faults.

    This means that insert_pfn() needs to follow the lead of
    insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If 'mkwrite'
    is set insert_pfn() will do the work that was previously done by
    wp_page_reuse() as part of the dax_pfn_mkwrite() call path.

    Link: http://lkml.kernel.org/r/20170724170616.25810-2-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: "Darrick J. Wong"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

01 Sep, 2017

1 commit

  • Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
    and make sure it is bracketed by calls to *_invalidate_range_start()/end().

    Note that because we can not presume the pmd value or pte value we have
    to assume the worst and unconditionaly report an invalidation as
    happening.

    Signed-off-by: Jérôme Glisse
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Bernhard Held
    Cc: Adam Borowski
    Cc: Andrea Arcangeli
    Cc: Radim Krčmář
    Cc: Wanpeng Li
    Cc: Paolo Bonzini
    Cc: Takashi Iwai
    Cc: Nadav Amit
    Cc: Mike Galbraith
    Cc: Kirill A. Shutemov
    Cc: axie
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

19 Aug, 2017

2 commits

  • Wenwei Tao has noticed that our current assumption that the oom victim
    is dying and never doing any visible changes after it dies, and so the
    oom_reaper can tear it down, is not entirely true.

    __task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT is set
    but do_group_exit sends SIGKILL to all threads _after_ the flag is set.
    So there is a race window when some threads won't have
    fatal_signal_pending while the oom_reaper could start unmapping the
    address space. Moreover some paths might not check for fatal signals
    before each PF/g-u-p/copy_from_user.

    We already have a protection for oom_reaper vs. PF races by checking
    MMF_UNSTABLE. This has been, however, checked only for kernel threads
    (use_mm users) which can outlive the oom victim. A simple fix would be
    to extend the current check in handle_mm_fault for all tasks but that
    wouldn't be sufficient because the current check assumes that a kernel
    thread would bail out after EFAULT from get_user*/copy_from_user and
    never re-read the same address which would succeed because the PF path
    has established page tables already. This seems to be the case for the
    only existing use_mm user currently (virtio driver) but it is rather
    fragile in general.

    This is even more fragile in general for more complex paths such as
    generic_perform_write which can re-read the same address more times
    (e.g. iov_iter_copy_from_user_atomic to fail and then
    iov_iter_fault_in_readable on retry).

    Therefore we have to implement MMF_UNSTABLE protection in a robust way
    and never make a potentially corrupted content visible. That requires
    to hook deeper into the PF path and check for the flag _every time_
    before a pte for anonymous memory is established (that means all
    !VM_SHARED mappings).

    The corruption can be triggered artificially
    (http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp)
    but there doesn't seem to be any real life bug report. The race window
    should be quite tight to trigger most of the time.

    Link: http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org
    Fixes: aac453635549 ("mm, oom: introduce oom reaper")
    Signed-off-by: Michal Hocko
    Reported-by: Wenwei Tao
    Tested-by: Tetsuo Handa
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Argangeli
    Cc: David Rientjes
    Cc: Oleg Nesterov
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
    handle_mm_fault causes a lockdep splat

    Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
    Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
    a.out (1169) used greatest stack depth: 11664 bytes left
    DEBUG_LOCKS_WARN_ON(depth
    Signed-off-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Andrea Argangeli
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Wenwei Tao
    Cc: [4.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

11 Aug, 2017

2 commits

  • Nadav reported parallel MADV_DONTNEED on same range has a stale TLB
    problem and Mel fixed it[1] and found same problem on MADV_FREE[2].

    Quote from Mel Gorman:
    "The race in question is CPU 0 running madv_free and updating some PTEs
    while CPU 1 is also running madv_free and looking at the same PTEs.
    CPU 1 may have writable TLB entries for a page but fail the pte_dirty
    check (because CPU 0 has updated it already) and potentially fail to
    flush.

    Hence, when madv_free on CPU 1 returns, there are still potentially
    writable TLB entries and the underlying PTE is still present so that a
    subsequent write does not necessarily propagate the dirty bit to the
    underlying PTE any more. Reclaim at some unknown time at the future
    may then see that the PTE is still clean and discard the page even
    though a write has happened in the meantime. I think this is possible
    but I could have missed some protection in madv_free that prevents it
    happening."

    This patch aims for solving both problems all at once and is ready for
    other problem with KSM, MADV_FREE and soft-dirty story[3].

    TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending
    and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can
    catch there are parallel threads going on. In that case, forcefully,
    flush TLB to prevent for user to access memory via stale TLB entry
    although it fail to gather page table entry.

    I confirmed this patch works with [4] test program Nadav gave so this
    patch supersedes "mm: Always flush VMA ranges affected by zap_page_range
    v2" in current mmotm.

    NOTE:

    This patch modifies arch-specific TLB gathering interface(x86, ia64,
    s390, sh, um). It seems most of architecture are straightforward but
    s390 need to be careful because tlb_flush_mmu works only if
    mm->context.flush_mm is set to non-zero which happens only a pte entry
    really is cleared by ptep_get_and_clear and friends. However, this
    problem never changes the pte entries but need to flush to prevent
    memory access from stale tlb.

    [1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net
    [2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de
    [3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
    [4] https://patchwork.kernel.org/patch/9861621/

    [minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu]
    Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox
    Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.com
    Signed-off-by: Minchan Kim
    Signed-off-by: Nadav Amit
    Reported-by: Nadav Amit
    Reported-by: Mel Gorman
    Acked-by: Mel Gorman
    Cc: Ingo Molnar
    Cc: Russell King
    Cc: Tony Luck
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Jeff Dike
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Nadav Amit
    Cc: Rik van Riel
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch is a preparatory patch for solving race problems caused by
    TLB batch. For that, we will increase/decrease TLB flush pending count
    of mm_struct whenever tlb_[gather|finish]_mmu is called.

    Before making it simple, this patch separates architecture specific part
    and rename it to arch_tlb_[gather|finish]_mmu and generic part just
    calls it.

    It shouldn't change any behavior.

    Link: http://lkml.kernel.org/r/20170802000818.4760-5-namit@vmware.com
    Signed-off-by: Minchan Kim
    Signed-off-by: Nadav Amit
    Acked-by: Mel Gorman
    Cc: Ingo Molnar
    Cc: Russell King
    Cc: Tony Luck
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Jeff Dike
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Nadav Amit
    Cc: Rik van Riel
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

03 Aug, 2017

1 commit

  • Nadav Amit identified a theoritical race between page reclaim and
    mprotect due to TLB flushes being batched outside of the PTL being held.

    He described the race as follows:

    CPU0 CPU1
    ---- ----
    user accesses memory using RW PTE
    [PTE now cached in TLB]
    try_to_unmap_one()
    ==> ptep_get_and_clear()
    ==> set_tlb_ubc_flush_pending()
    mprotect(addr, PROT_READ)
    ==> change_pte_range()
    ==> [ PTE non-present - no flush ]

    user writes using cached RW PTE
    ...

    try_to_unmap_flush()

    The same type of race exists for reads when protecting for PROT_NONE and
    also exists for operations that can leave an old TLB entry behind such
    as munmap, mremap and madvise.

    For some operations like mprotect, it's not necessarily a data integrity
    issue but it is a correctness issue as there is a window where an
    mprotect that limits access still allows access. For munmap, it's
    potentially a data integrity issue although the race is massive as an
    munmap, mmap and return to userspace must all complete between the
    window when reclaim drops the PTL and flushes the TLB. However, it's
    theoritically possible so handle this issue by flushing the mm if
    reclaim is potentially currently batching TLB flushes.

    Other instances where a flush is required for a present pte should be ok
    as either the page lock is held preventing parallel reclaim or a page
    reference count is elevated preventing a parallel free leading to
    corruption. In the case of page_mkclean there isn't an obvious path
    that userspace could take advantage of without using the operations that
    are guarded by this patch. Other users such as gup as a race with
    reclaim looks just at PTEs. huge page variants should be ok as they
    don't race with reclaim. mincore only looks at PTEs. userfault also
    should be ok as if a parallel reclaim takes place, it will either fault
    the page back in or read some of the data before the flush occurs
    triggering a fault.

    Note that a variant of this patch was acked by Andy Lutomirski but this
    was for the x86 parts on top of his PCID work which didn't make the 4.13
    merge window as expected. His ack is dropped from this version and
    there will be a follow-on patch on top of PCID that will include his
    ack.

    [akpm@linux-foundation.org: tweak comments]
    [akpm@linux-foundation.org: fix spello]
    Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
    Reported-by: Nadav Amit
    Signed-off-by: Mel Gorman
    Cc: Andy Lutomirski
    Cc: [v4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Jul, 2017

1 commit

  • With gcc 4.1.2:

    mm/memory.o: In function `create_huge_pmd':
    memory.c:(.text+0x93e): undefined reference to `do_huge_pmd_anonymous_page'

    Interestingly, create_huge_pmd() is emitted in the assembler output, but
    never called.

    Converting transparent_hugepage_enabled() from a macro to a static
    inline function reduced the ability of the compiler to remove unused
    code.

    Fix this by marking create_huge_pmd() inline.

    Fixes: 16981d763501c0e0 ("mm: improve readability of transparent_hugepage_enabled()")
    Link: http://lkml.kernel.org/r/1499842660-10665-1-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

11 Jul, 2017

1 commit

  • The preferred strategy to define debugfs attributes is to use the
    DEFINE_DEBUGFS_ATTRIBUTE() macro and to use debugfs_create_file_unsafe().

    Link: http://lkml.kernel.org/r/20170528145948.32127-1-y.pronenko@gmail.com
    Signed-off-by: Yevgen Pronenko
    Cc: "Kirill A . Shutemov"
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yevgen Pronenko
     

07 Jul, 2017

2 commits

  • Track the following reclaim counters for every memory cgroup: PGREFILL,
    PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

    These values are exposed using the memory.stats interface of cgroup v2.

    The meaning of each value is the same as for global counters, available
    using /proc/vmstat.

    Also, for consistency, rename mem_cgroup_count_vm_event() to
    count_memcg_event_mm().

    Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • pte_offset_map_lock() finds and takes ptl, and returns pte. But some
    callers return without unlocking the ptl when pte == NULL, which seems
    weird.

    Git history said that !pte check in change_pte_range() was introduced in
    commit 1ad9f620c3a2 ("mm: numa: recheck for transhuge pages under lock
    during protection changes") and still remains after commit 175ad4f1e7a2
    ("mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock")
    which partially reverts 1ad9f620c3a2. So I think that it's just dead
    code.

    Many other caller of pte_offset_map_lock() never check NULL return, so
    let's do likewise.

    Link: http://lkml.kernel.org/r/1495089737-1292-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

19 Jun, 2017

1 commit

  • Stack guard page is a useful feature to reduce a risk of stack smashing
    into a different mapping. We have been using a single page gap which
    is sufficient to prevent having stack adjacent to a different mapping.
    But this seems to be insufficient in the light of the stack usage in
    userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
    used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
    which is 256kB or stack strings with MAX_ARG_STRLEN.

    This will become especially dangerous for suid binaries and the default
    no limit for the stack size limit because those applications can be
    tricked to consume a large portion of the stack and a single glibc call
    could jump over the guard page. These attacks are not theoretical,
    unfortunatelly.

    Make those attacks less probable by increasing the stack guard gap
    to 1MB (on systems with 4k pages; but make it depend on the page size
    because systems with larger base pages might cap stack allocations in
    the PAGE_SIZE units) which should cover larger alloca() and VLA stack
    allocations. It is obviously not a full fix because the problem is
    somehow inherent, but it should reduce attack space a lot.

    One could argue that the gap size should be configurable from userspace,
    but that can be done later when somebody finds that the new 1MB is wrong
    for some special case applications. For now, add a kernel command line
    option (stack_guard_gap) to specify the stack gap size (in page units).

    Implementation wise, first delete all the old code for stack guard page:
    because although we could get away with accounting one extra page in a
    stack vma, accounting a larger gap can break userspace - case in point,
    a program run with "ulimit -S -v 20000" failed when the 1MB gap was
    counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
    and strict non-overcommit mode.

    Instead of keeping gap inside the stack vma, maintain the stack guard
    gap as a gap between vmas: using vm_start_gap() in place of vm_start
    (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
    places which need to respect the gap - mainly arch_get_unmapped_area(),
    and and the vma tree's subtree_gap support for that.

    Original-patch-by: Oleg Nesterov
    Original-patch-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Tested-by: Helge Deller # parisc
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

03 Jun, 2017

1 commit

  • When the pmd_devmap() checks were added by 5c7fb56e5e3f ("mm, dax:
    dax-pmd vs thp-pmd vs hugetlbfs-pmd") to add better support for DAX huge
    pages, they were all added to the end of if() statements after existing
    pmd_trans_huge() checks. So, things like:

    - if (pmd_trans_huge(*pmd))
    + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))

    When further checks were added after pmd_trans_unstable() checks by
    commit 7267ec008b5c ("mm: postpone page table allocation until we have
    page to map") they were also added at the end of the conditional:

    + if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))

    This ordering is fine for pmd_trans_huge(), but doesn't work for
    pmd_trans_unstable(). This is because DAX huge pages trip the bad_pmd()
    check inside of pmd_none_or_trans_huge_or_clear_bad() (called by
    pmd_trans_unstable()), which prints out a warning and returns 1. So, we
    do end up doing the right thing, but only after spamming dmesg with
    suspicious looking messages:

    mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)

    Reorder these checks in a helper so that pmd_devmap() is checked first,
    avoiding the error messages, and add a comment explaining why the
    ordering is important.

    Fixes: commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map")
    Link: http://lkml.kernel.org/r/20170522215749.23516-1-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Pawel Lebioda
    Cc: "Darrick J. Wong"
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Matthew Wilcox
    Cc: "Kirill A . Shutemov"
    Cc: Dave Jiang
    Cc: Xiong Zhou
    Cc: Eryu Guan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

02 Apr, 2017

1 commit


29 Mar, 2017

1 commit


10 Mar, 2017

2 commits


02 Mar, 2017

4 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • …sched/numa_balancing.h>

    We are going to split <linux/sched/numa_balancing.h> out of <linux/sched.h>, which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder <linux/sched/numa_balancing.h> file that just
    maps to <linux/sched.h> to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

25 Feb, 2017

1 commit

  • Patch series "Numabalancing preserve write fix", v2.

    This patch series address an issue w.r.t THP migration and autonuma
    preserve write feature. migrate_misplaced_transhuge_page() cannot deal
    with concurrent modification of the page. It does a page copy without
    following the migration pte sequence. IIUC, this was done to keep the
    migration simpler and at the time of implemenation we didn't had THP
    page cache which would have required a more elaborate migration scheme.
    That means thp autonuma migration expect the protnone with saved write
    to be done such that both kernel and user cannot update the page
    content. This patch series enables archs like ppc64 to do that. We are
    good with the hash translation mode with the current code, because we
    never create a hardware page table entry for a protnone pte.

    This patch (of 2):

    Autonuma preserves the write permission across numa fault to avoid
    taking a writefault after a numa fault (Commit: b191f9b106ea " mm: numa:
    preserve PTE write permissions across a NUMA hinting fault").
    Architecture can implement protnone in different ways and some may
    choose to implement that by clearing Read/ Write/Exec bit of pte.
    Setting the write bit on such pte can result in wrong behaviour. Fix
    this up by allowing arch to override how to save the write bit on a
    protnone pte.

    [aneesh.kumar@linux.vnet.ibm.com: don't mark pte saved write in case of dirty_accountable]
    Link: http://lkml.kernel.org/r/1487942884-16517-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    [aneesh.kumar@linux.vnet.ibm.com: v3]
    Link: http://lkml.kernel.org/r/1487498625-10891-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1487050314-3892-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Michael Neuling
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V