19 Feb, 2016

2 commits

  • Protection keys provide new page-based protection in hardware.
    But, they have an interesting attribute: they only affect data
    accesses and never affect instruction fetches. That means that
    if we set up some memory which is set as "access-disabled" via
    protection keys, we can still execute from it.

    This patch uses protection keys to set up mappings to do just that.
    If a user calls:

    mmap(..., PROT_EXEC);
    or
    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
    notice this, and set a special protection key on the memory. It
    also sets the appropriate bits in the Protection Keys User Rights
    (PKRU) register so that the memory becomes unreadable and
    unwritable.

    I haven't found any userspace that does this today. With this
    facility in place, we expect userspace to move to use it
    eventually. Userspace _could_ start doing this today. Any
    PROT_EXEC calls get converted to PROT_READ inside the kernel, and
    would transparently be upgraded to "true" PROT_EXEC with this
    code. IOW, userspace never has to do any PROT_EXEC runtime
    detection.

    This feature provides enhanced protection against leaking
    executable memory contents. This helps thwart attacks which are
    attempting to find ROP gadgets on the fly.

    But, the security provided by this approach is not comprehensive.
    The PKRU register which controls access permissions is a normal
    user register writable from unprivileged userspace. An attacker
    who can execute the 'wrpkru' instruction can easily disable the
    protection provided by this feature.

    The protection key that is used for execute-only support is
    permanently dedicated at compile time. This is fine for now
    because there is currently no API to set a protection key other
    than this one.

    Despite there being a constant PKRU value across the entire
    system, we do not set it unless this feature is in use in a
    process. That is to preserve the PKRU XSAVE 'init state',
    which can lead to faster context switches.

    PKRU *is* a user register and the kernel is modifying it. That
    means that code doing:

    pkru = rdpkru()
    pkru |= 0x100;
    mmap(..., PROT_EXEC);
    wrpkru(pkru);

    could lose the bits in PKRU that enforce execute-only
    permissions. To avoid this, we suggest avoiding ever calling
    mmap() or mprotect() when the PKRU value is expected to be
    unstable.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Chen Gang
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kees Cook
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Piotr Kwapulinski
    Cc: Rik van Riel
    Cc: Stephen Smalley
    Cc: Vladimir Murzin
    Cc: Will Deacon
    Cc: keescook@google.com
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • This plumbs a protection key through calc_vm_flag_bits(). We
    could have done this in calc_vm_prot_bits(), but I did not feel
    super strongly which way to go. It was pretty arbitrary which
    one to use.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Arve Hjønnevåg
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Chen Gang
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Airlie
    Cc: Denys Vlasenko
    Cc: Eric W. Biederman
    Cc: Geliang Tang
    Cc: Greg Kroah-Hartman
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Leon Romanovsky
    Cc: Linus Torvalds
    Cc: Masahiro Yamada
    Cc: Maxime Coquelin
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Oleg Nesterov
    Cc: Paul Gortmaker
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Riley Andrews
    Cc: Vladimir Davydov
    Cc: devel@driverdev.osuosl.org
    Cc: linux-api@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/20160212210231.E6F1F0D6@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

12 Feb, 2016

1 commit

  • DAX implements split_huge_pmd() by clearing pmd. This simple approach
    reduces memory overhead, as we don't need to deposit page table on huge
    page mapping to make split_huge_pmd() never-fail. PTE table can be
    allocated and populated later on page fault from backing store.

    But one side effect is that have to check if pmd is pmd_none() after
    split_huge_pmd(). In most places we do this already to deal with
    parallel MADV_DONTNEED.

    But I found two call sites which is not affected by MADV_DONTNEED (due
    down_write(mmap_sem)), but need to have the check to work with DAX
    properly.

    Signed-off-by: Kirill A. Shutemov
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Andrea Arcangeli
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Jan, 2016

2 commits

  • A dax-huge-page mapping while it uses some thp helpers is ultimately not
    a transparent huge page. The distinction is especially important in the
    get_user_pages() path. pmd_devmap() is used to distinguish dax-pmds
    from pmd_huge() and pmd_trans_huge() which have slightly different
    semantics.

    Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
    pmd_devmap() checks.

    [kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in __split_huge_pmd()]
    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • We are going to decouple splitting THP PMD from splitting underlying
    compound page.

    This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
    to reflect the fact that it doesn't imply page splitting, only PMD.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit

  • When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
    testing the RLIMIT_DATA value to figure out if we're allowed to assign
    new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
    commited that RLIMIT_DATA in a form it's implemented now doesn't do
    anything useful because most of user-space libraries use mmap() syscall
    for dynamic memory allocations.

    Linus suggested to convert RLIMIT_DATA rlimit into something suitable
    for anonymous memory accounting. But in this patch we go further, and
    the changes are bundled together as:

    * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
    * replace mm->shared_vm with better defined mm->data_vm
    * account anonymous executable areas as executable
    * account file-backed growsdown/up areas as stack
    * drop struct file* argument from vm_stat_account
    * enforce RLIMIT_DATA for size of data areas

    This way code looks cleaner: now code/stack/data classification depends
    only on vm_flags state:

    VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
    VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
    VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

    The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
    "shared", but that might be strange beast like readonly-private or VM_IO
    area.

    - RLIMIT_AS limits whole address space "VmSize"
    - RLIMIT_STACK limits stack "VmStk" (but each vma individually)
    - RLIMIT_DATA now limits "VmData"

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Cyrill Gorcunov
    Cc: Quentin Casasnovas
    Cc: Vegard Nossum
    Acked-by: Linus Torvalds
    Cc: Willy Tarreau
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Vladimir Davydov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

05 Sep, 2015

1 commit

  • vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
    must be aware about so that we can merge vmas back like they were
    originally before arming the userfaultfd on some memory range.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Jun, 2015

1 commit

  • On mlock(2) we trigger COW on private writable VMA to avoid faults in
    future.

    mm/gup.c:
    840 long populate_vma_page_range(struct vm_area_struct *vma,
    841 unsigned long start, unsigned long end, int *nonblocking)
    842 {
    ...
    855 * We want to touch writable mappings with a write fault in order
    856 * to break COW, except for shared mappings because these don't COW
    857 * and we would not want to dirty them for nothing.
    858 */
    859 if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
    860 gup_flags |= FOLL_WRITE;

    But we miss this case when we make VM_LOCKED VMA writeable via
    mprotect(2). The test case:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define PAGE_SIZE 4096

    int main(int argc, char **argv)
    {
    struct rusage usage;
    long before;
    char *p;
    int fd;

    /* Create a file and populate first page of page cache */
    fd = open("/tmp", O_TMPFILE | O_RDWR, S_IRUSR | S_IWUSR);
    write(fd, "1", 1);

    /* Create a *read-only* *private* mapping of the file */
    p = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);

    /*
    * Since the mapping is read-only, mlock() will populate the mapping
    * with PTEs pointing to page cache without triggering COW.
    */
    mlock(p, PAGE_SIZE);

    /*
    * Mapping became read-write, but it's still populated with PTEs
    * pointing to page cache.
    */
    mprotect(p, PAGE_SIZE, PROT_READ | PROT_WRITE);

    getrusage(RUSAGE_SELF, &usage);
    before = usage.ru_minflt;

    /* Trigger COW: fault in mlock()ed VMA. */
    *p = 1;

    getrusage(RUSAGE_SELF, &usage);
    printf("faults: %ld\n", usage.ru_minflt - before);

    return 0;
    }

    $ ./test
    faults: 1

    Let's fix it by triggering populating of VMA in mprotect_fixup() on this
    condition. We don't care about population error as we don't in other
    similar cases i.e. mremap.

    [akpm@linux-foundation.org: tweak comment text]
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

26 Mar, 2015

1 commit

  • Protecting a PTE to trap a NUMA hinting fault clears the writable bit
    and further faults are needed after trapping a NUMA hinting fault to set
    the writable bit again. This patch preserves the writable bit when
    trapping NUMA hinting faults. The impact is obvious from the number of
    minor faults trapped during the basis balancing benchmark and the system
    CPU usage;

    autonumabench
    4.0.0-rc4 4.0.0-rc4
    baseline preserve
    Time System-NUMA01 107.13 ( 0.00%) 103.13 ( 3.73%)
    Time System-NUMA01_THEADLOCAL 131.87 ( 0.00%) 83.30 ( 36.83%)
    Time System-NUMA02 8.95 ( 0.00%) 10.72 (-19.78%)
    Time System-NUMA02_SMT 4.57 ( 0.00%) 3.99 ( 12.69%)
    Time Elapsed-NUMA01 515.78 ( 0.00%) 517.26 ( -0.29%)
    Time Elapsed-NUMA01_THEADLOCAL 384.10 ( 0.00%) 384.31 ( -0.05%)
    Time Elapsed-NUMA02 48.86 ( 0.00%) 48.78 ( 0.16%)
    Time Elapsed-NUMA02_SMT 47.98 ( 0.00%) 48.12 ( -0.29%)

    4.0.0-rc4 4.0.0-rc4
    baseline preserve
    User 44383.95 43971.89
    System 252.61 201.24
    Elapsed 998.68 1000.94

    Minor Faults 2597249 1981230
    Major Faults 365 364

    There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
    workload

    4.0.0-rc4 4.0.0-rc4
    baseline preserve
    Amean real-xfsrepair 454.14 ( 0.00%) 442.36 ( 2.60%)
    Amean syst-xfsrepair 277.20 ( 0.00%) 204.68 ( 26.16%)

    The patch looks hacky but the alternatives looked worse. The tidest was
    to rewalk the page tables after a hinting fault but it was more complex
    than this approach and the performance was worse. It's not generally
    safe to just mark the page writable during the fault if it's a write
    fault as it may have been read-only for COW so that approach was
    discarded.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Chinner
    Tested-by: Dave Chinner
    Cc: Ingo Molnar
    Cc: Aneesh Kumar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Feb, 2015

4 commits

  • If a PTE or PMD is already marked NUMA when scanning to mark entries for
    NUMA hinting then it is not necessary to update the entry and incur a TLB
    flush penalty. Avoid the avoidhead where possible.

    Signed-off-by: Mel Gorman
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Faults on the huge zero page are pointless and there is a BUG_ON to catch
    them during fault time. This patch reintroduces a check that avoids
    marking the zero page PAGE_NONE.

    Signed-off-by: Mel Gorman
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • With PROT_NONE, the traditional page table manipulation functions are
    sufficient.

    [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
    [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
    Signed-off-by: Mel Gorman
    Acked-by: Linus Torvalds
    Acked-by: Aneesh Kumar
    Tested-by: Sasha Levin
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Convert existing users of pte_numa and friends to the new helper. Note
    that the kernel is broken after this patch is applied until the other page
    table modifiers are also altered. This patch layout is to make review
    easier.

    Signed-off-by: Mel Gorman
    Acked-by: Linus Torvalds
    Acked-by: Aneesh Kumar
    Acked-by: Benjamin Herrenschmidt
    Tested-by: Sasha Levin
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

11 Feb, 2015

1 commit


14 Oct, 2014

1 commit

  • For VMAs that don't want write notifications, PTEs created for read faults
    have their write bit set. If the read fault happens after VM_SOFTDIRTY is
    cleared, then the PTE's softdirty bit will remain clear after subsequent
    writes.

    Here's a simple code snippet to demonstrate the bug:

    char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
    assert(*m == '\0'); /* new PTE allows write access */
    assert(!soft_dirty(x));
    *m = 'x'; /* should dirty the page */
    assert(soft_dirty(x)); /* fails */

    With this patch, write notifications are enabled when VM_SOFTDIRTY is
    cleared. Furthermore, to avoid unnecessary faults, write notifications
    are disabled when VM_SOFTDIRTY is set.

    As a side effect of enabling and disabling write notifications with
    care, this patch fixes a bug in mprotect where vm_page_prot bits set by
    drivers were zapped on mprotect. An analogous bug was fixed in mmap by
    commit c9d0bf241451 ("mm: uncached vma support with writenotify").

    Signed-off-by: Peter Feiner
    Reported-by: Peter Feiner
    Suggested-by: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Jamie Liu
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Bjorn Helgaas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Feiner
     

08 Apr, 2014

3 commits

  • The NUMA scanning code can end up iterating over many gigabytes of
    unpopulated memory, especially in the case of a freshly started KVM
    guest with lots of memory.

    This results in the mmu notifier code being called even when there are
    no mapped pages in a virtual address range. The amount of time wasted
    can be enough to trigger soft lockup warnings with very large KVM
    guests.

    This patch moves the mmu notifier call to the pmd level, which
    represents 1GB areas of memory on x86-64. Furthermore, the mmu notifier
    code is only called from the address in the PMD where present mappings
    are first encountered.

    The hugetlbfs code is left alone for now; hugetlb mappings are not
    relocatable, and as such are left alone by the NUMA code, and should
    never trigger this problem to begin with.

    Signed-off-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Reported-by: Xing Gang
    Tested-by: Chegu Vinod
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Sasha reported the following bug using trinity

    kernel BUG at mm/mprotect.c:149!
    invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G W 3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105
    task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000
    RIP: change_protection_range+0x3b3/0x500
    Call Trace:
    change_protection+0x25/0x30
    change_prot_numa+0x1b/0x30
    task_numa_work+0x279/0x360
    task_work_run+0xae/0xf0
    do_notify_resume+0x8e/0xe0
    retint_signal+0x4d/0x92

    The VM_BUG_ON was added in -mm by the patch "mm,numa: reorganize
    change_pmd_range". The race existed without the patch but was just
    harder to hit.

    The problem is that a transhuge check is made without holding the PTL.
    It's possible at the time of the check that a parallel fault clears the
    pmd and inserts a new one which then triggers the VM_BUG_ON check. This
    patch removes the VM_BUG_ON but fixes the race by rechecking transhuge
    under the PTL when marking page tables for NUMA hinting and bailing if a
    race occurred. It is not a problem for calls to mprotect() as they hold
    mmap_sem for write.

    Signed-off-by: Mel Gorman
    Reported-by: Sasha Levin
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Reorganize the order of ifs in change_pmd_range a little, in preparation
    for the next patch.

    [akpm@linux-foundation.org: fix indenting, per David]
    Signed-off-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Reported-by: Xing Gang
    Tested-by: Chegu Vinod
    Acked-by: David Rientjes
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

17 Feb, 2014

2 commits

  • Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using
    a hash table MMU for various reasons (the flush is handled as part of
    the PTE modification when necessary).

    ppc64 thus doesn't implement flush_tlb_range for hash based MMUs.

    Additionally ppc64 require the tlb flushing to be batched within ptl locks.

    The reason to do that is to ensure that the hash page table is in sync with
    linux page table.

    We track the hpte index in linux pte and if we clear them without flushing
    hash and drop the ptl lock, we can have another cpu update the pte and can
    end up with duplicate entry in the hash table, which is fatal.

    We also want to keep set_pte_at simpler by not requiring them to do hash
    flush for performance reason. We do that by assuming that set_pte_at() is
    never *ever* called on a PTE that is already valid.

    This was the case until the NUMA code went in which broke that assumption.

    Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a
    way similar to ptep/pmdp_set_wrprotect(), with a generic implementation
    using set_pte_at() and a powerpc specific one using the appropriate
    mechanism needed to keep the hash table in sync.

    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt

    Aneesh Kumar K.V
     
  • So move it within the if loop

    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt

    Aneesh Kumar K.V
     

22 Jan, 2014

1 commit

  • KSM pages can be shared between tasks that are not necessarily related
    to each other from a NUMA perspective. This patch causes those pages to
    be ignored by automatic NUMA balancing so they do not migrate and do not
    cause unrelated tasks to be grouped together.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

19 Dec, 2013

3 commits

  • There are a few subtle races, between change_protection_range (used by
    mprotect and change_prot_numa) on one side, and NUMA page migration and
    compaction on the other side.

    The basic race is that there is a time window between when the PTE gets
    made non-present (PROT_NONE or NUMA), and the TLB is flushed.

    During that time, a CPU may continue writing to the page.

    This is fine most of the time, however compaction or the NUMA migration
    code may come in, and migrate the page away.

    When that happens, the CPU may continue writing, through the cached
    translation, to what is no longer the current memory location of the
    process.

    This only affects x86, which has a somewhat optimistic pte_accessible.
    All other architectures appear to be safe, and will either always flush,
    or flush whenever there is a valid mapping, even with no permissions
    (SPARC).

    The basic race looks like this:

    CPU A CPU B CPU C

    load TLB entry
    make entry PTE/PMD_NUMA
    fault on entry
    read/write old page
    start migrating page
    change PTE/PMD to new page
    read/write old page [*]
    flush TLB
    reload TLB from new entry
    read/write new page
    lose data

    [*] the old page may belong to a new user at this point!

    The obvious fix is to flush remote TLB entries, by making sure that
    pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
    still be accessible if there is a TLB flush pending for the mm.

    This should fix both NUMA migration and compaction.

    [mgorman@suse.de: fix build]
    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • On a protection change it is no longer clear if the page should be still
    accessible. This patch clears the NUMA hinting fault bits on a
    protection change.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The TLB must be flushed if the PTE is updated but change_pte_range is
    clearing the PTE while marking PTEs pte_numa without necessarily
    flushing the TLB if it reinserts the same entry. Without the flush,
    it's conceivable that two processors have different TLBs for the same
    virtual address and at the very least it would generate spurious faults.

    This patch only unmaps the pages in change_pte_range for a full
    protection change.

    [riel@redhat.com: write pte_numa pte back to the page tables]
    Signed-off-by: Mel Gorman
    Signed-off-by: Rik van Riel
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc: Chegu Vinod
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Nov, 2013

1 commit

  • Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one
    PTE update") was added to account for the number of PTE updates when
    marking pages prot_numa. task_numa_work was using the old return value
    to track how much address space had been updated. Altering the return
    value causes the scanner to do more work than it is configured or
    documented to in a single unit of work.

    This patch reverts that commit and accounts for the number of THP
    updates separately in vmstat. It is up to the administrator to
    interpret the pair of values correctly. This is a straight-forward
    operation and likely to only be of interest when actively debugging NUMA
    balancing problems.

    The impact of this patch is that the NUMA PTE scanner will scan slower
    when THP is enabled and workloads may converge slower as a result. On
    the flip size system CPU usage should be lower than recent tests
    reported. This is an illustrative example of a short single JVM specjbb
    test

    specjbb
    3.12.0 3.12.0
    vanilla acctupdates
    TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%)
    TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%)
    TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%)
    TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%)
    TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%)
    TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%)
    TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%)
    TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%)

    3.12.0 3.12.0
    vanillaacctupdates
    User 5169.64 5184.14
    System 100.45 80.02
    Elapsed 252.75 251.85

    Performance is similar but note the reduction in system CPU time. While
    this showed a performance gain, it will not be universal but at least
    it'll be behaving as documented. The vmstats are obviously different but
    here is an obvious interpretation of them from mmtests.

    3.12.0 3.12.0
    vanillaacctupdates
    NUMA page range updates 1408326 11043064
    NUMA huge PMD updates 0 21040
    NUMA PTE updates 1408326 291624

    "NUMA page range updates" == nr_pte_updates and is the value returned to
    the NUMA pte scanner. NUMA huge PMD updates were the number of THP
    updates which in combination can be used to calculate how many ptes were
    updated from userspace.

    Signed-off-by: Mel Gorman
    Reported-by: Alex Thorlton
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Nov, 2013

1 commit

  • Resolve cherry-picking conflicts:

    Conflicts:
    mm/huge_memory.c
    mm/memory.c
    mm/mprotect.c

    See this upstream merge commit for more details:

    52469b4fcd4f Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

29 Oct, 2013

1 commit

  • A THP PMD update is accounted for as 512 pages updated in vmstat. This is
    large difference when estimating the cost of automatic NUMA balancing and
    can be misleading when comparing results that had collapsed versus split
    THP. This patch addresses the accounting issue.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

17 Oct, 2013

1 commit

  • If page migration is turned on in config and the page is migrating, we
    may lose the soft dirty bit. If fork and mprotect are called on
    migrating pages (once migration is complete) pages do not obtain the
    soft dirty bit in the correspond pte entries. Fix it adding an
    appropriate test on swap entries.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Andy Lutomirski
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

09 Oct, 2013

8 commits

  • With the THP migration races closed it is still possible to occasionally
    see corruption. The problem is related to handling PMD pages in batch.
    When a page fault is handled it can be assumed that the page being
    faulted will also be flushed from the TLB. The same flushing does not
    happen when handling PMD pages in batch. Fixing is straight forward but
    there are a number of reasons not to

    1. Multiple TLB flushes may have to be sent depending on what pages get
    migrated
    2. The handling of PMDs in batch means that faults get accounted to
    the task that is handling the fault. While care is taken to only
    mark PMDs where the last CPU and PID match it can still have problems
    due to PID truncation when matching PIDs.
    3. Batching on the PMD level may reduce faults but setting pmd_numa
    requires taking a heavy lock that can contend with THP migration
    and handling the fault requires the release/acquisition of the PTL
    for every page migrated. It's still pretty heavy.

    PMD batch handling is not something that people ever have been happy
    with. This patch removes it and later patches will deal with the
    additional fault overhead using more installigent migrate rate adaption.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-48-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Change the per page last fault tracking to use cpu,pid instead of
    nid,pid. This will allow us to try and lookup the alternate task more
    easily. Note that even though it is the cpu that is store in the page
    flags that the mpol_misplaced decision is still based on the node.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
    [ Fixed build failure on 32-bit systems. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Base page PMD faulting is meant to batch handle NUMA hinting faults from
    PTEs. However, even is no PTE faults would ever be handled within a
    range the kernel still traps PMD hinting faults. This patch avoids the
    overhead.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-37-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Ideally it would be possible to distinguish between NUMA hinting faults that
    are private to a task and those that are shared. If treated identically
    there is a risk that shared pages bounce between nodes depending on
    the order they are referenced by tasks. Ultimately what is desirable is
    that task private pages remain local to the task while shared pages are
    interleaved between sharing tasks running on different nodes to give good
    average performance. This is further complicated by THP as even
    applications that partition their data may not be partitioning on a huge
    page boundary.

    To start with, this patch assumes that multi-threaded or multi-process
    applications partition their data and that in general the private accesses
    are more important for cpu->memory locality in the general case. Also,
    no new infrastructure is required to treat private pages properly but
    interleaving for shared pages requires additional infrastructure.

    To detect private accesses the pid of the last accessing task is required
    but the storage requirements are a high. This patch borrows heavily from
    Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
    to encode some bits from the last accessing task in the page flags as
    well as the node information. Collisions will occur but it is better than
    just depending on the node information. Node information is then used to
    determine if a page needs to migrate. The PID information is used to detect
    private/shared accesses. The preferred NUMA node is selected based on where
    the maximum number of approximately private faults were measured. Shared
    faults are not taken into consideration for a few reasons.

    First, if there are many tasks sharing the page then they'll all move
    towards the same node. The node will be compute overloaded and then
    scheduled away later only to bounce back again. Alternatively the shared
    tasks would just bounce around nodes because the fault information is
    effectively noise. Either way accounting for shared faults the same as
    private faults can result in lower performance overall.

    The second reason is based on a hypothetical workload that has a small
    number of very important, heavily accessed private pages but a large shared
    array. The shared array would dominate the number of faults and be selected
    as a preferred node even though it's the wrong decision.

    The third reason is that multiple threads in a process will race each
    other to fault the shared page making the fault information unreliable.

    Signed-off-by: Mel Gorman
    [ Fix complication error when !NUMA_BALANCING. ]
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Currently automatic NUMA balancing is unable to distinguish between false
    shared versus private pages except by ignoring pages with an elevated
    page_mapcount entirely. This avoids shared pages bouncing between the
    nodes whose task is using them but that is ignored quite a lot of data.

    This patch kicks away the training wheels in preparation for adding support
    for identifying shared/private pages is now in place. The ordering is so
    that the impact of the shared/private detection can be easily measured. Note
    that the patch does not migrate shared, file-backed within vmas marked
    VM_EXEC as these are generally shared library pages. Migrating such pages
    is not beneficial as there is an expectation they are read-shared between
    caches and iTLB and iCache pressure is generally low.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-28-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • NUMA PTE scanning is expensive both in terms of the scanning itself and
    the TLB flush if there are any updates. The TLB flush is avoided if no
    PTEs are updated but there is a bug where transhuge PMDs are considered
    to be updated even if they were already pmd_numa. This patch addresses
    the problem and TLB flushes should be reduced.

    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Reviewed-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-12-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • NUMA PTE scanning is expensive both in terms of the scanning itself and
    the TLB flush if there are any updates. Currently non-present PTEs are
    accounted for as an update and incurring a TLB flush where it is only
    necessary for anonymous migration entries. This patch addresses the
    problem and should reduce TLB flushes.

    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Reviewed-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-11-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • A THP PMD update is accounted for as 512 pages updated in vmstat. This is
    large difference when estimating the cost of automatic NUMA balancing and
    can be misleading when comparing results that had collapsed versus split
    THP. This patch addresses the accounting issue.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

19 Dec, 2012

1 commit


17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • Pass vma instead of mm and add address parameter.

    In most cases we already have vma on the stack. We provides
    split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

    This change is preparation to huge zero pmd splitting implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Dec, 2012

1 commit

  • To say that the PMD handling code was incorrectly transferred from autonuma
    is an understatement. The intention was to handle a PMDs worth of pages
    in the same fault and effectively batch the taking of the PTL and page
    migration. The copied version instead has the impact of clearing a number
    of pte_numa PTE entries and whether any page migration takes place depends
    on racing. This just happens to work in some cases.

    This patch handles pte_numa faults in batch when a pmd_numa fault is
    handled. The pages are migrated if they are currently misplaced.
    Essentially this is making an assumption that NUMA locality is
    on a PMD boundary but that could be addressed by only setting
    pmd_numa if all the pages within that PMD are on the same node
    if necessary.

    Signed-off-by: Mel Gorman

    Mel Gorman