06 Jan, 2017

1 commit

  • commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3 upstream.

    It is the reasonable expectation that if an executable file is not
    readable there will be no way for a user without special privileges to
    read the file. This is enforced in ptrace_attach but if ptrace
    is already attached before exec there is no enforcement for read-only
    executables.

    As the only way to read such an mm is through access_process_vm
    spin a variant called ptrace_access_vm that will fail if the
    target process is not being ptraced by the current process, or
    the current process did not have sufficient privileges when ptracing
    began to read the target processes mm.

    In the ptrace implementations replace access_process_vm by
    ptrace_access_vm. There remain several ptrace sites that still use
    access_process_vm as they are reading the target executables
    instructions (for kernel consumption) or register stacks. As such it
    does not appear necessary to add a permission check to those calls.

    This bug has always existed in Linux.

    Fixes: v1.0
    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

19 Oct, 2016

4 commits

  • This removes the 'write' argument from access_process_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Acked-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' argument from access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' argument from __access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' from get_user_pages_remote() and
    replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

08 Oct, 2016

2 commits

  • vm_insert_mixed() unlike vm_insert_pfn_prot() and vmf_insert_pfn_pmd(),
    fails to check the pgprot_t it uses for the mapping against the one
    recorded in the memtype tracking tree. Add the missing call to
    track_pfn_insert() to preclude cases where incompatible aliased mappings
    are established for a given physical address range.

    Link: http://lkml.kernel.org/r/147328717909.35069.14256589123570653697.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Cc: David Airlie
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • There are only few use_mm() users in the kernel right now. Most of them
    write to the target memory but vhost driver relies on
    copy_from_user/get_user from a kernel thread context. This makes it
    impossible to reap the memory of an oom victim which shares the mm with
    the vhost kernel thread because it could see a zero page unexpectedly
    and theoretically make an incorrect decision visible outside of the
    killed task context.

    To quote Michael S. Tsirkin:
    : Getting an error from __get_user and friends is handled gracefully.
    : Getting zero instead of a real value will cause userspace
    : memory corruption.

    The vhost kernel thread is bound to an open fd of the vhost device which
    is not tight to the mm owner life cycle in general. The device fd can
    be inherited or passed over to another process which means that we
    really have to be careful about unexpected memory corruption because
    unlike for normal oom victims the result will be visible outside of the
    oom victim context.

    Make sure that no kthread context (users of use_mm) can ever see
    corrupted data because of the oom reaper and hook into the page fault
    path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the
    flag before it starts unmapping the address space while the flag is
    checked after the page fault has been handled. If the flag is set then
    SIGBUS is triggered so any g-u-p user will get a error code.

    Regular tasks do not need this protection because all which share the mm
    are killed when the mm is reaped and so the corruption will not outlive
    them.

    This patch shouldn't have any visible effect at this moment because the
    OOM killer doesn't invoke oom reaper for tasks with mm shared with
    kthreads yet.

    Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: "Michael S. Tsirkin"
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

30 Sep, 2016

1 commit


26 Sep, 2016

1 commit

  • The NUMA balancing logic uses an arch-specific PROT_NONE page table flag
    defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page
    PMDs respectively as requiring balancing upon a subsequent page fault.
    User-defined PROT_NONE memory regions which also have this flag set will
    not normally invoke the NUMA balancing code as do_page_fault() will send
    a segfault to the process before handle_mm_fault() is even called.

    However if access_remote_vm() is invoked to access a PROT_NONE region of
    memory, handle_mm_fault() is called via faultin_page() and
    __get_user_pages() without any access checks being performed, meaning
    the NUMA balancing logic is incorrectly invoked on a non-NUMA memory
    region.

    A simple means of triggering this problem is to access PROT_NONE mmap'd
    memory using /proc/self/mem which reliably results in the NUMA handling
    functions being invoked when CONFIG_NUMA_BALANCING is set.

    This issue was reported in bugzilla (issue 99101) which includes some
    simple repro code.

    There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page()
    added at commit c0e7cad to avoid accidentally provoking strange
    behaviour by attempting to apply NUMA balancing to pages that are in
    fact PROT_NONE. The BUG_ON()'s are consistently triggered by the repro.

    This patch moves the PROT_NONE check into mm/memory.c rather than
    invoking BUG_ON() as faulting in these pages via faultin_page() is a
    valid reason for reaching the NUMA check with the PROT_NONE page table
    flag set and is therefore not always a bug.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101
    Reported-by: Trevor Saunders
    Signed-off-by: Lorenzo Stoakes
    Acked-by: Rik van Riel
    Cc: Andrew Morton
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

14 Sep, 2016

1 commit

  • Commit:

    4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations")

    changed NUMA balancing from _PAGE_NUMA to using PROT_NONE, and was quickly
    found to introduce a regression with NUMA grouping.

    It was followed up by these commits:

    53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic")
    bea66fbd11af ("mm: numa: group related processes based on VMA flags instead of page table flags")
    b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault")

    The first of those two commits try alternate approaches to NUMA
    grouping, which apparently do not work as well as looking at the PTE
    write permissions.

    The latter patch preserves the PTE write permissions across a NUMA
    protection fault. However, it forgets to revert the condition for
    whether or not to group tasks together back to what it was before
    v3.19, even though the information is now preserved in the page tables
    once again.

    This patch brings the NUMA grouping heuristic back to what it was
    before commit 4d9424669946, which the changelogs of subsequent
    commits suggest worked best.

    We have all the information again. We should probably use it.

    Signed-off-by: Rik van Riel
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: aarcange@redhat.com
    Cc: linux-mm@kvack.org
    Cc: mgorman@suse.de
    Link: http://lkml.kernel.org/r/20160908213053.07c992a9@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

03 Aug, 2016

2 commits

  • Every swap-in anonymous page starts from inactive lru list's head. It
    should be activated unconditionally when VM decide to reclaim because
    page table entry for the page always usually has marked accessed bit.
    Thus, their window size for getting a new referece is 2 * NR_inactive +
    NR_active while others is NR_inactive + NR_active.

    It's not fair that it has more chance to be referenced compared to other
    newly allocated page which starts from active lru list's head.

    Johannes:

    : The page can still have a valid copy on the swap device, so prefering to
    : reclaim that page over a fresh one could make sense. But as you point
    : out, having it start inactive instead of active actually ends up giving it
    : *more* LRU time, and that seems to be without justification.

    Rik:

    : The reason newly read in swap cache pages start on the inactive list is
    : that we do some amount of read-around, and do not know which pages will
    : get used.
    :
    : However, immediately activating the ones that DO get used, like your patch
    : does, is the right thing to do.

    Link: http://lkml.kernel.org/r/1469762740-17860-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Nadav Amit
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • I ran into this:

    BUG: sleeping function called from invalid context at mm/page_alloc.c:3784
    in_atomic(): 0, irqs_disabled(): 0, pid: 1434, name: trinity-c1
    2 locks held by trinity-c1/1434:
    #0: (&mm->mmap_sem){......}, at: [] __do_page_fault+0x1ce/0x8f0
    #1: (rcu_read_lock){......}, at: [] filemap_map_pages+0xd6/0xdd0

    CPU: 0 PID: 1434 Comm: trinity-c1 Not tainted 4.7.0+ #58
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
    dump_stack+0x65/0x84
    panic+0x185/0x2dd
    ___might_sleep+0x51c/0x600
    __might_sleep+0x90/0x1a0
    __alloc_pages_nodemask+0x5b1/0x2160
    alloc_pages_current+0xcc/0x370
    pte_alloc_one+0x12/0x90
    __pte_alloc+0x1d/0x200
    alloc_set_pte+0xe3e/0x14a0
    filemap_map_pages+0x42b/0xdd0
    handle_mm_fault+0x17d5/0x28b0
    __do_page_fault+0x310/0x8f0
    trace_do_page_fault+0x18d/0x310
    do_async_page_fault+0x27/0xa0
    async_page_fault+0x28/0x30

    The important bits from the above is that filemap_map_pages() is calling
    into the page allocator while holding rcu_read_lock (sleeping is not
    allowed inside RCU read-side critical sections).

    According to Kirill Shutemov, the prefaulting code in do_fault_around()
    is supposed to take care of this, but missing error handling means that
    the allocation failure can go unnoticed.

    We don't need to return VM_FAULT_OOM (or any other error) here, since we
    can just let the normal fault path try again.

    Fixes: 7267ec008b5c ("mm: postpone page table allocation until we have page to map")
    Link: http://lkml.kernel.org/r/1469708107-11868-1-git-send-email-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Kirill A. Shutemov
    Cc: "Hillf Danton"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

27 Jul, 2016

12 commits

  • For file mappings, we don't deposit page tables on THP allocation
    because it's not strictly required to implement split_huge_pmd(): we can
    just clear pmd and let following page faults to reconstruct the page
    table.

    But Power makes use of deposited page table to address MMU quirk.

    Let's hide THP page cache, including huge tmpfs, under separate config
    option, so it can be forbidden on Power.

    We can revert the patch later once solution for Power found.

    Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Here's basic implementation of huge pages support for shmem/tmpfs.

    It's all pretty streight-forward:

    - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

    - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

    - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

    This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
    behaviour. As we don't really create hole in this case,
    lseek(SEEK_HOLE) may have inconsistent results depending what
    pages happened to be allocated.

    - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

    Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • File COW for THP is handled on pte level: just split the pmd.

    It's not clear how benefitial would be allocation of huge pages on COW
    faults. And it would require some code to make them work.

    I think at some point we can consider teaching khugepaged to collapse
    pages in COW mappings, but allocating huge on fault is probably
    overkill.

    Link: http://lkml.kernel.org/r/1466021202-61880-16-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • THP_FILE_ALLOC: how many times huge page was allocated and put page
    cache.

    THP_FILE_MAPPED: how many times file huge page was mapped.

    Link: http://lkml.kernel.org/r/1466021202-61880-13-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With postponed page table allocation we have chance to setup huge pages.
    do_set_pte() calls do_set_pmd() if following criteria met:

    - page is compound;
    - pmd entry in pmd_none();
    - vma has suitable size and alignment;

    Link: http://lkml.kernel.org/r/1466021202-61880-12-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Naive approach: on mapping/unmapping the page as compound we update
    ->_mapcount on each 4k page. That's not efficient, but it's not obvious
    how we can optimize this. We can look into optimization later.

    PG_double_map optimization doesn't work for file pages since lifecycle
    of file pages is different comparing to anon pages: file page can be
    mapped again at any time.

    Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The idea (and most of code) is borrowed again: from Hugh's patchset on
    huge tmpfs[1].

    Instead of allocation pte page table upfront, we postpone this until we
    have page to map in hands. This approach opens possibility to map the
    page as huge if filesystem supports this.

    Comparing to Hugh's patch I've pushed page table allocation a bit
    further: into do_set_pte(). This way we can postpone allocation even in
    faultaround case without moving do_fault_around() after __do_fault().

    do_set_pte() got renamed to alloc_set_pte() as it can allocate page
    table if required.

    [1] http://lkml.kernel.org/r/alpine.LSU.2.11.1502202015090.14414@eggly.anvils

    Link: http://lkml.kernel.org/r/1466021202-61880-10-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The idea borrowed from Peter's patch from patchset on speculative page
    faults[1]:

    Instead of passing around the endless list of function arguments,
    replace the lot with a single structure so we can change context without
    endless function signature changes.

    The changes are mostly mechanical with exception of faultaround code:
    filemap_map_pages() got reworked a bit.

    This patch is preparation for the next one.

    [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

    Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We always have vma->vm_mm around.

    Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patch makes swapin readahead to improve thp collapse rate. When
    khugepaged scanned pages, there can be a few of the pages in swap area.

    With the patch THP can collapse 4kB pages into a THP when there are up
    to max_ptes_swap swap ptes in a 2MB range.

    The patch was tested with a test program that allocates 400B of memory,
    writes to it, and then sleeps. I force the system to swap out all.
    Afterwards, the test program touches the area by writing, it skips a
    page in each 20 pages of the area.

    Without the patch, system did not swap in readahead. THP rate was %65
    of the program of the memory, it did not change over time.

    With this patch, after 10 minutes of waiting khugepaged had collapsed
    %99 of the program's memory.

    [kirill.shutemov@linux.intel.com: trivial cleanup of exit path of the function]
    [kirill.shutemov@linux.intel.com: __collapse_huge_page_swapin(): drop unused 'pte' parameter]
    [kirill.shutemov@linux.intel.com: do not hold anon_vma lock during swap in]
    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Xie XiuQi
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • This allows an arch which needs to do special handing with respect to
    different page size when flushing tlb to implement the same in mmu
    gather.

    Link: http://lkml.kernel.org/r/1465049193-22197-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • This updates the generic and arch specific implementation to return true
    if we need to do a tlb flush. That means if a __tlb_remove_page
    indicate a flush is needed, the page we try to remove need to be tracked
    and added again after the flush. We need to track it because we have
    already update the pte to none and we can't just loop back.

    This change is done to enable us to do a tlb_flush when we try to flush
    a range that consists of different page sizes. For architectures like
    ppc64, we can do a range based tlb flush and we need to track page size
    for that. When we try to remove a huge page, we will force a tlb flush
    and starts a new mmu gather.

    [aneesh.kumar@linux.vnet.ibm.com: mm-change-the-interface-for-__tlb_remove_page-v3]
    Link: http://lkml.kernel.org/r/1465049193-22197-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1464860389-29019-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

15 Jul, 2016

1 commit

  • The VM_BUG_ON_PAGE in page_move_anon_rmap() is more trouble than it's
    worth: the syzkaller fuzzer hit it again. It's still wrong for some THP
    cases, because linear_page_index() was never intended to apply to
    addresses before the start of a vma.

    That's easily fixed with a signed long cast inside linear_page_index();
    and Dmitry has tested such a patch, to verify the false positive. But
    why extend linear_page_index() just for this case? when the avoidance in
    page_move_anon_rmap() has already grown ugly, and there's no reason for
    the check at all (nothing else there is using address or index).

    Remove address arg from page_move_anon_rmap(), remove VM_BUG_ON_PAGE,
    remove CONFIG_DEBUG_VM PageTransHuge adjustment.

    And one more thing: should the compound_head(page) be done inside or
    outside page_move_anon_rmap()? It's usually pushed down to the lowest
    level nowadays (and mm/memory.c shows no other explicit use of it), so I
    think it's better done in page_move_anon_rmap() than by caller.

    Fixes: 0798d3c022dc ("mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()")
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1607120444540.12528@eggly.anvils
    Signed-off-by: Hugh Dickins
    Reported-by: Dmitry Vyukov
    Acked-by: Kirill A. Shutemov
    Cc: Mika Westerberg
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

25 Jun, 2016

2 commits

  • This reverts commit d0834a6c2c5b0c76cfb806bd7dba6556d8b4edbb.

    After revert of 5c0a85fad949 ("mm: make faultaround produce old ptes")
    faultaround doesn't have dependencies on hardware accessed bit, so let's
    revert this one too.

    Link: http://lkml.kernel.org/r/1465893750-44080-3-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: "Huang, Ying"
    Cc: Linus Torvalds
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Vinayak Menon
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This reverts commit 5c0a85fad949212b3e059692deecdeed74ae7ec7.

    The commit causes ~6% regression in unixbench.

    Let's revert it for now and consider other solution for reclaim problem
    later.

    Link: http://lkml.kernel.org/r/1465893750-44080-2-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: "Huang, Ying"
    Cc: Linus Torvalds
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Vinayak Menon
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

27 May, 2016

1 commit

  • Pull DAX locking updates from Ross Zwisler:
    "Filesystem DAX locking for 4.7

    - We use a bit in an exceptional radix tree entry as a lock bit and
    use it similarly to how page lock is used for normal faults. This
    fixes races between hole instantiation and read faults of the same
    index.

    - Filesystem DAX PMD faults are disabled, and will be re-enabled when
    PMD locking is implemented"

    * tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: Remove i_mmap_lock protection
    dax: Use radix tree entry lock to protect cow faults
    dax: New fault locking
    dax: Allow DAX code to replace exceptional entries
    dax: Define DAX lock bit for radix tree exceptional entry
    dax: Make huge page handling depend of CONFIG_BROKEN
    dax: Fix condition for filling of PMD holes

    Linus Torvalds
     

21 May, 2016

3 commits

  • fault_around aims to reduce minor faults of file-backed pages via
    speculative ahead pte mapping and relying on readahead logic. However,
    on non-HW access bit architecture the benefit is highly limited because
    they should emulate the young bit with minor faults for reclaim's page
    aging algorithm. IOW, we cannot reduce minor faults on those
    architectures.

    I did quick a test on my ARM machine.

    512M file mmap sequential every word read on eSATA drive 4 times.
    stddev is stable.

    = fault_around 4096 =
    elapsed time(usec): 6747645

    = fault_around 65536 =
    elapsed time(usec): 6709263

    0.5% gain.

    Even when I tested it with eMMC there is no gain because I guess with
    slow storage the major fault is the dominant factor.

    Also, fault_around has the side effect of shrinking slab more
    aggressively and causes higher vmpressure, so if such speculation fails,
    it can evict slab more which can result in page I/O (e.g., inode cache).
    In the end, it would make void any benefit of fault_around.

    So let's make the default "disabled" on those architectures.

    Link: http://lkml.kernel.org/r/20160518014229.GB21538@bbox
    Signed-off-by: Minchan Kim
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Currently, faultaround code produces young pte. This can screw up
    vmscan behaviour[1], as it makes vmscan think that these pages are hot
    and not push them out on first round.

    During sparse file access faultaround gets more pages mapped and all of
    them are young. Under memory pressure, this makes vmscan swap out anon
    pages instead, or to drop other page cache pages which otherwise stay
    resident.

    Modify faultaround to produce old ptes, so they can easily be reclaimed
    under memory pressure.

    This can to some extend defeat the purpose of faultaround on machines
    without hardware accessed bit as it will not help us with reducing the
    number of minor page faults.

    We may want to disable faultaround on such machines altogether, but
    that's subject for separate patchset.

    Minchan:
    "I tested 512M mmap sequential word read test on non-HW access bit
    system (i.e., ARM) and confirmed it doesn't increase minor fault any
    more.

    old: 4096 fault_around
    minor fault: 131291
    elapsed time: 6747645 usec

    new: 65536 fault_around
    minor fault: 131291
    elapsed time: 6709263 usec

    0.56% benefit"

    [1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org

    Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Tested-by: Minchan Kim
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We use generic hooks in remap_pfn_range() to help archs to track pfnmap
    regions. The code is something like:

    int remap_pfn_range()
    {
    ...
    track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
    ...
    pfn -= addr >> PAGE_SHIFT;
    ...
    untrack_pfn(vma, pfn, PAGE_ALIGN(size));
    ...
    }

    Here we can easily find the pfn is changed but not recovered before
    untrack_pfn() is called. That's incorrect.

    There are no known runtime effects - this is from inspection.

    Signed-off-by: Yongji Xie
    Cc: Kirill A. Shutemov
    Cc: Jerome Marchand
    Cc: Ingo Molnar
    Cc: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Andy Lutomirski
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yongji Xie
     

20 May, 2016

2 commits

  • Currently faults are protected against truncate by filesystem specific
    i_mmap_sem and page lock in case of hole page. Cow faults are protected
    DAX radix tree entry locking. So there's no need for i_mmap_lock in DAX
    code. Remove it.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     
  • When doing cow faults, we cannot directly fill in PTE as we do for other
    faults as we rely on generic code to do proper accounting of the cowed page.
    We also have no page to lock to protect against races with truncate as
    other faults have and we need the protection to extend until the moment
    generic code inserts cowed page into PTE thus at that point we have no
    protection of fs-specific i_mmap_sem. So far we relied on using
    i_mmap_lock for the protection however that is completely special to cow
    faults. To make fault locking more uniform use DAX entry lock instead.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler

    Jan Kara
     

13 May, 2016

1 commit

  • This will provide fully accuracy to the mapcount calculation in the
    write protect faults, so page pinning will not get broken by false
    positive copy-on-writes.

    total_mapcount() isn't the right calculation needed in
    reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
    that is effectively the full accurate return value for page_mapcount()
    if dealing with Transparent Hugepages, however we only use the
    page_trans_huge_mapcount() during COW faults where it strictly needed,
    due to its higher runtime cost.

    This also provide at practical zero cost the total_mapcount
    information which is needed to know if we can still relocate the page
    anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
    can reuse the page no matter if it's a pte or a pmd_trans_huge
    triggering the fault, but we can only relocate the page anon_vma to
    the local vma->anon_vma if we're sure it's only this "vma" mapping the
    whole THP physical range.

    Kirill A. Shutemov discovered the problem with moving the page
    anon_vma to the local vma->anon_vma in a previous version of this
    patch and another problem in the way page_move_anon_rmap() was called.

    Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
    previous version, because reuse_swap_page must be a macro to call
    page_trans_huge_mapcount from swap.h, so this uses a macro again
    instead of an inline function. With this change at least it's a less
    dangerous usage than it was before, because "page" is used only once
    now, while with the previous code reuse_swap_page(page++) would have
    called page_mapcount on page+1 and it would have increased page twice
    instead of just once.

    Dean Luick noticed an uninitialized variable that could result in a
    rmap inefficiency for the non-THP case in a previous version.

    Mike Marciniszyn said:

    : Our RDMA tests are seeing an issue with memory locking that bisects to
    : commit 61f5d698cc97 ("mm: re-enable THP")
    :
    : The test program registers two rather large MRs (512M) and RDMA
    : writes data to a passive peer using the first and RDMA reads it back
    : into the second MR and compares that data. The sizes are chosen randomly
    : between 0 and 1024 bytes.
    :
    : The test will get through a few (
    Reviewed-by: "Kirill A. Shutemov"
    Reviewed-by: Dean Luick
    Tested-by: Alex Williamson
    Tested-by: Mike Marciniszyn
    Tested-by: Josh Collier
    Cc: Marc Haber
    Cc: [4.5]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

06 May, 2016

1 commit

  • zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(&mmap_sem) BUG() will
    be invalid with huge pagecache, in whatever way it is implemented:
    truncation of a hugely-mapped file to an unhugely-aligned size would
    easily hit it.

    (Although anon THP could in principle apply khugepaged to private file
    mappings, which are not excluded by the MADV_HUGEPAGE restrictions, in
    practice there's a vm_ops check which excludes them, so it never hits
    this BUG() - there's no interface to "truncate" an anonymous mapping.)

    We could complicate the test, to check i_mmap_rwsem also when there's a
    vm_file; but my inclination was to make zap_pmd_range() more readable by
    simply deleting this check. A search has shown no report of the issue
    in the years since commit e0897d75f0b2 ("mm, thp: print useful
    information when mmap_sem is unlocked in zap_pmd_range") expanded it
    from VM_BUG_ON() - though I cannot point to what commit I would say then
    fixed the issue.

    But there are a couple of other patches now floating around, neither yet
    in the tree: let's agree to retain the check as a VM_BUG_ON_VMA(), as
    Matthew Wilcox has done; but subject to a vma_is_anonymous() check, as
    Kirill Shutemov has done. And let's get this in, without waiting for
    any particular huge pagecache implementation to reach the tree.

    Matthew said "We can reproduce this BUG() in the current Linus tree with
    DAX PMDs".

    Signed-off-by: Hugh Dickins
    Tested-by: Matthew Wilcox
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Andres Lagar-Cavilla
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

29 Apr, 2016

1 commit

  • In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
    because the layouts may differ depending on the architecture. On s390
    this will lead to inaccurate numa_maps accounting in /proc because of
    misguided pte_present() and pte_dirty() checks on the fake pte.

    On other architectures pte_present() and pte_dirty() may work by chance,
    but there may be an issue with direct-access (dax) mappings w/o
    underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
    available. In vm_normal_page() the fake pte will be checked with
    pte_special() and because there is no "special" bit in a pmd, this will
    always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
    skipped. On dax mappings w/o struct pages, an invalid struct page
    pointer would then be returned that can crash the kernel.

    This patch fixes the numa_maps THP handling by introducing new "_pmd"
    variants of the can_gather_numa_stats() and vm_normal_page() functions.

    Signed-off-by: Gerald Schaefer
    Cc: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Johannes Weiner
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Dan Williams
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Michael Holzheu
    Cc: [4.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

26 Mar, 2016

1 commit

  • This patch (of 5):

    This is based on the idea from Mel Gorman discussed during LSFMM 2015
    and independently brought up by Oleg Nesterov.

    The OOM killer currently allows to kill only a single task in a good
    hope that the task will terminate in a reasonable time and frees up its
    memory. Such a task (oom victim) will get an access to memory reserves
    via mark_oom_victim to allow a forward progress should there be a need
    for additional memory during exit path.

    It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
    construct workloads which break the core assumption mentioned above and
    the OOM victim might take unbounded amount of time to exit because it
    might be blocked in the uninterruptible state waiting for an event (e.g.
    lock) which is blocked by another task looping in the page allocator.

    This patch reduces the probability of such a lockup by introducing a
    specialized kernel thread (oom_reaper) which tries to reclaim additional
    memory by preemptively reaping the anonymous or swapped out memory owned
    by the oom victim under an assumption that such a memory won't be needed
    when its owner is killed and kicked from the userspace anyway. There is
    one notable exception to this, though, if the OOM victim was in the
    process of coredumping the result would be incomplete. This is
    considered a reasonable constrain because the overall system health is
    more important than debugability of a particular application.

    A kernel thread has been chosen because we need a reliable way of
    invocation so workqueue context is not appropriate because all the
    workers might be busy (e.g. allocating memory). Kswapd which sounds
    like another good fit is not appropriate as well because it might get
    blocked on locks during reclaim as well.

    oom_reaper has to take mmap_sem on the target task for reading so the
    solution is not 100% because the semaphore might be held or blocked for
    write but the probability is reduced considerably wrt. basically any
    lock blocking forward progress as described above. In order to prevent
    from blocking on the lock without any forward progress we are using only
    a trylock and retry 10 times with a short sleep in between. Users of
    mmap_sem which need it for write should be carefully reviewed to use
    _killable waiting as much as possible and reduce allocations requests
    done with the lock held to absolute minimum to reduce the risk even
    further.

    The API between oom killer and oom reaper is quite trivial.
    wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
    NULL->mm transition and oom_reaper clear this atomically once it is done
    with the work. This means that only a single mm_struct can be reaped at
    the time. As the operation is potentially disruptive we are trying to
    limit it to the ncessary minimum and the reaper blocks any updates while
    it operates on an mm. mm_struct is pinned by mm_count to allow parallel
    exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).

    Signed-off-by: Michal Hocko
    Suggested-by: Oleg Nesterov
    Suggested-by: Mel Gorman
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Andrea Argangeli
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds