11 Dec, 2014

1 commit

  • As a small zero page, huge zero page should not be accounted in smaps
    report as normal page.

    For small pages we rely on vm_normal_page() to filter out zero page, but
    vm_normal_page() is not designed to handle pmds. We only get here due
    hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
    necessary compatible on each and every architecture.

    Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will
    detect huge zero page for us.

    We would need pmd_dirty() helper to do this properly. The patch adds it
    to THP-enabled architectures which don't yet have one.

    [akpm@linux-foundation.org: use do_div to fix 32-bit build]
    Signed-off-by: "Kirill A. Shutemov"
    Reported-by: Fengguang Wu
    Tested-by: Fengwei Yin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Nov, 2014

1 commit

  • MPX-enabled applications using large swaths of memory can
    potentially have large numbers of bounds tables in process
    address space to save bounds information. These tables can take
    up huge swaths of memory (as much as 80% of the memory on the
    system) even if we clean them up aggressively. In the worst-case
    scenario, the tables can be 4x the size of the data structure
    being tracked. IOW, a 1-page structure can require 4 bounds-table
    pages.

    Being this huge, our expectation is that folks using MPX are
    going to be keen on figuring out how much memory is being
    dedicated to it. So we need a way to track memory use for MPX.

    If we want to specifically track MPX VMAs we need to be able to
    distinguish them from normal VMAs, and keep them from getting
    merged with normal VMAs. A new VM_ flag set only on MPX VMAs does
    both of those things. With this flag, MPX bounds-table VMAs can
    be distinguished from other VMAs, and userspace can also walk
    /proc/$pid/smaps to get memory usage for MPX.

    In addition to this flag, we also introduce a special ->vm_ops
    specific to MPX VMAs (see the patch "add MPX specific mmap
    interface"), but currently different ->vm_ops do not by
    themselves prevent VMA merging, so we still need this flag.

    We understand that VM_ flags are scarce and are open to other
    options.

    Signed-off-by: Qiaowei Ren
    Signed-off-by: Dave Hansen
    Cc: linux-mm@kvack.org
    Cc: linux-mips@linux-mips.org
    Cc: Dave Hansen
    Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Qiaowei Ren
     

14 Oct, 2014

1 commit

  • For VMAs that don't want write notifications, PTEs created for read faults
    have their write bit set. If the read fault happens after VM_SOFTDIRTY is
    cleared, then the PTE's softdirty bit will remain clear after subsequent
    writes.

    Here's a simple code snippet to demonstrate the bug:

    char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
    assert(*m == '\0'); /* new PTE allows write access */
    assert(!soft_dirty(x));
    *m = 'x'; /* should dirty the page */
    assert(soft_dirty(x)); /* fails */

    With this patch, write notifications are enabled when VM_SOFTDIRTY is
    cleared. Furthermore, to avoid unnecessary faults, write notifications
    are disabled when VM_SOFTDIRTY is set.

    As a side effect of enabling and disabling write notifications with
    care, this patch fixes a bug in mprotect where vm_page_prot bits set by
    drivers were zapped on mprotect. An analogous bug was fixed in mmap by
    commit c9d0bf241451 ("mm: uncached vma support with writenotify").

    Signed-off-by: Peter Feiner
    Reported-by: Peter Feiner
    Suggested-by: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Jamie Liu
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Bjorn Helgaas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Feiner
     

10 Oct, 2014

15 commits

  • If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
    VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
    as softdirty. Here's a program to demonstrate the bug:

    int main() {
    const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
    uint64_t pme[3];
    int fd = open("/proc/self/pagemap", O_RDONLY);;
    char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
    MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    munmap(m + getpagesize(), getpagesize());
    pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
    assert(pme[0] & PAGEMAP_SOFTDIRTY); /* passes */
    assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
    assert(pme[2] & PAGEMAP_SOFTDIRTY); /* passes */
    return 0;
    }

    (Note that all pages in new VMAs are softdirty until cleared).

    Tested:
    Used the program given above. I'm going to include this code in
    a selftest in the future.

    [n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
    Signed-off-by: Peter Feiner
    Cc: "Kirill A. Shutemov"
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Jamie Liu
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Feiner
     
  • 9e7814404b77 "hold task->mempolicy while numa_maps scans." fixed the
    race with the exiting task but this is not enough.

    The current code assumes that get_vma_policy(task) should either see
    task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
    by hold_task_mempolicy(), so we can never race with __mpol_put(). But
    this can only work if we can't race with do_set_mempolicy(), and thus
    we can't race with another do_set_mempolicy() or do_exit() after that.

    However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
    race. This task can exec, change it's ->mm, and call do_set_mempolicy()
    after that; in this case they take 2 different locks.

    Change hold_task_mempolicy() to use get_task_policy(), it never returns
    NULL, and change show_numa_map() to use __get_vma_policy() or fall back
    to proc_priv->task_mempolicy.

    Note: this is the minimal fix, we will cleanup this code later. I think
    hold_task_mempolicy() and release_task_mempolicy() should die, we can
    move this logic into show_numa_map(). Or we can move get_task_policy()
    outside of ->mmap_sem and !CONFIG_NUMA code at least.

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • - Rename vm_is_stack() to task_of_stack() and change it to return
    "struct task_struct *" rather than the global (and thus wrong in
    general) pid_t.

    - Add the new pid_of_stack() helper which calls task_of_stack() and
    uses the right namespace to report the correct pid_t.

    Unfortunately we need to define this helper twice, in task_mmu.c
    and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
    and move at least pid_of_stack/task_of_stack there to avoid the
    code duplication.

    - Change show_map_vma() and show_numa_map() to use the new helper.

    Signed-off-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: Greg Ungerer
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • m_start() can use get_proc_task() instead, and "struct inode *"
    provides more potentially useful info, see the next changes.

    Signed-off-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: Greg Ungerer
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change the main loop in m_start() to update m->version. Mostly for
    consistency, but this can help to avoid the same loop if the very
    1st ->show() fails due to seq_overflow().

    Signed-off-by: Oleg Nesterov
    Cc: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Add the "last_addr" optimization back. Like before, every ->show()
    method checks !seq_overflow() and sets m->version = vma->vm_start.

    However, it also checks that m_next_vma(vma) != NULL, otherwise it
    sets m->version = -1 for the lockless "EOF" fast-path in m_start().

    m_start() can simply do find_vma() + m_next_vma() if last_addr is
    not zero, the code looks clear and simple and this case is clearly
    separated from "scan vmas" path.

    Signed-off-by: Oleg Nesterov
    Cc: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Extract the tail_vma/vm_next calculation from m_next() into the new
    trivial helper, m_next_vma().

    Signed-off-by: Oleg Nesterov
    Cc: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that m->version is gone we can cleanup m_start(). In particular,

    - Remove the "unsigned long" typecast, m->index can't be negative
    or exceed ->map_count. But lets use "unsigned int pos" to make
    it clear that "pos < map_count" is safe.

    - Remove the unnecessary "vma != NULL" check in the main loop. It
    can't be NULL unless we have a vm bug.

    - This also means that "pos < map_count" case can simply return the
    valid vma and avoid "goto" and subsequent checks.

    Signed-off-by: Oleg Nesterov
    Cc: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • m_start() carefully documents, checks, and sets "m->version = -1" if
    we are going to return NULL. The only problem is that we will be never
    called again if m_start() returns NULL, so this is simply pointless
    and misleading.

    Otoh, ->show() methods m->version = 0 if vma == tail_vma and this is
    just wrong, we want -1 in this case. And in fact we also want -1 if
    ->vm_next == NULL and ->tail_vma == NULL.

    And it is not used consistently, the "scan vmas" loop in m_start()
    should update last_addr too.

    Finally, imo the whole "last_addr" logic in m_start() looks horrible.
    find_vma(last_addr) is called unconditionally even if we are not going
    to use the result. But the main problem is that this code participates
    in tail_vma-or-NULL mess, and this looks simply unfixable.

    Remove this optimization. We will add it back after some cleanups.

    Signed-off-by: Oleg Nesterov
    Cc: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. There is no reason to reset ->tail_vma in m_start(), if we return
    IS_ERR_OR_NULL() it won't be used.

    2. m_start() also clears priv->task to ensure that m_stop() won't use
    the stale pointer if we fail before get_task_struct(). But this is
    ugly and confusing, move this initialization in m_stop().

    Signed-off-by: Oleg Nesterov
    Acked-by: Kirill A. Shutemov
    Acked-by: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. Kill the first "vma != NULL" check. Firstly this is not possible,
    m_next() won't be called if ->start() or the previous ->next()
    returns NULL.

    And if it was possible the 2nd "vma != tail_vma" check is buggy,
    we should not wrongly return ->tail_vma.

    2. Make this function readable. The logic is very simple, we should
    return check "vma != tail" once and return "vm_next || tail_vma".

    Signed-off-by: Oleg Nesterov
    Acked-by: Kirill A. Shutemov
    Acked-by: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • m_start() drops ->mmap_sem and does mmput() if it retuns vsyscall
    vma. This is because in this case m_stop()->vma_stop() obviously
    can't use gate_vma->vm_mm.

    Now that we have proc_maps_private->mm we can simplify this logic:

    - Change m_start() to return with ->mmap_sem held unless it returns
    IS_ERR_OR_NULL().

    - Change vma_stop() to use priv->mm and avoid the ugly vma checks,
    this makes "vm_area_struct *vma" unnecessary.

    - This also allows m_start() to use vm_stop().

    - Cleanup m_next() to follow the new locking rule.

    Note: m_stop() looks very ugly, and this temporary uglifies it
    even more. Fixed by the next change.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Acked-by: Kirill A. Shutemov
    Acked-by: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • A simple test-case from Kirill Shutemov

    cat /proc/self/maps >/dev/null
    chmod +x /proc/self/net/packet
    exec /proc/self/net/packet

    makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
    the opposite order.

    It's a false positive and probably we should not allow "chmod +x" on proc
    files. Still I think that we should avoid mm_access() and cred_guard_mutex
    in sys_read() paths, security checking should happen at open time. Besides,
    this doesn't even look right if the task changes its ->mm between m_stop()
    and m_start().

    Add the new "mm_struct *mm" member into struct proc_maps_private and change
    proc_maps_open() to initialize it using proc_mem_open(). Change m_start() to
    use priv->mm if atomic_inc_not_zero(mm_users) succeeds or return NULL (eof)
    otherwise.

    The only complication is that proc_maps_open() users should additionally do
    mmdrop() in fop->release(), add the new proc_map_release() helper for that.

    Note: this is the user-visible change, if the task execs after open("maps")
    the new ->mm won't be visible via this file. I hope this is fine, and this
    matches /proc/pid/mem bahaviour.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Reported-by: "Kirill A. Shutemov"
    Acked-by: Kirill A. Shutemov
    Acked-by: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_maps_open() and numa_maps_open() are overcomplicated, they could use
    __seq_open_private(). Plus they do the same, just sizeof(*priv)

    Change them to use a new simple helper, proc_maps_open(ops, psize). This
    simplifies the code and allows us to do the next changes.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kirill A. Shutemov
    Acked-by: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • get_gate_vma(priv->task->mm) looks ugly and wrong, task->mm can be NULL or
    it can changed by exec right after mm_access().

    And in theory this race is not harmless, the task can exec and then later
    exit and free the new mm_struct. In this case get_task_mm(oldmm) can't
    help, get_gate_vma(task->mm) can read the freed/unmapped memory.

    I think that priv->task should simply die and hold_task_mempolicy() logic
    can be simplified. tail_vma logic asks for cleanups too.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kirill A. Shutemov
    Acked-by: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

26 Sep, 2014

1 commit

  • In PTE holes that contain VM_SOFTDIRTY VMAs, unmapped addresses before
    VM_SOFTDIRTY VMAs are reported as softdirty by /proc/pid/pagemap. This
    bug was introduced in commit 68b5a6524856 ("mm: softdirty: respect
    VM_SOFTDIRTY in PTE holes"). That commit made /proc/pid/pagemap look at
    VM_SOFTDIRTY in PTE holes but neglected to observe the start of VMAs
    returned by find_vma.

    Tested:
    Wrote a selftest that creates a PMD-sized VMA then unmaps the first
    page and asserts that the page is not softdirty. I'm going to send the
    pagemap selftest in a later commit.

    Signed-off-by: Peter Feiner
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Kirill A. Shutemov"
    Cc: Jamie Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Feiner
     

07 Aug, 2014

1 commit

  • After a VMA is created with the VM_SOFTDIRTY flag set, /proc/pid/pagemap
    should report that the VMA's virtual pages are soft-dirty until
    VM_SOFTDIRTY is cleared (i.e., by the next write of "4" to
    /proc/pid/clear_refs). However, pagemap ignores the VM_SOFTDIRTY flag
    for virtual addresses that fall in PTE holes (i.e., virtual addresses
    that don't have a PMD, PUD, or PGD allocated yet).

    To observe this bug, use mmap to create a VMA large enough such that
    there's a good chance that the VMA will occupy an unused PMD, then test
    the soft-dirty bit on its pages. In practice, I found that a VMA that
    covered a PMD's worth of address space was big enough.

    This patch adds the necessary VMA lookup to the PTE hole callback in
    /proc/pid/pagemap's page walk and sets soft-dirty according to the VMAs'
    VM_SOFTDIRTY flag.

    Signed-off-by: Peter Feiner
    Acked-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Hugh Dickins
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Feiner
     

09 Jun, 2014

1 commit

  • Now that 3.15 is released, this merges the 'next' branch into 'master',
    bringing us to the normal situation where my 'master' branch is the
    merge window.

    * accumulated work in next: (6809 commits)
    ufs: sb mutex merge + mutex_destroy
    powerpc: update comments for generic idle conversion
    cris: update comments for generic idle conversion
    idle: remove cpu_idle() forward declarations
    nbd: zero from and len fields in NBD_CMD_DISCONNECT.
    mm: convert some level-less printks to pr_*
    MAINTAINERS: adi-buildroot-devel is moderated
    MAINTAINERS: add linux-api for review of API/ABI changes
    mm/kmemleak-test.c: use pr_fmt for logging
    fs/dlm/debug_fs.c: replace seq_printf by seq_puts
    fs/dlm/lockspace.c: convert simple_str to kstr
    fs/dlm/config.c: convert simple_str to kstr
    mm: mark remap_file_pages() syscall as deprecated
    mm: memcontrol: remove unnecessary memcg argument from soft limit functions
    mm: memcontrol: clean up memcg zoneinfo lookup
    mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
    mm/mempool.c: update the kmemleak stack trace for mempool allocations
    lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
    mm: introduce kmemleak_update_trace()
    mm/kmemleak.c: use %u to print ->checksum
    ...

    Linus Torvalds
     

07 Jun, 2014

2 commits

  • Signed-off-by: Fabian Frederick
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • The age table walker doesn't check non-present hugetlb entry in common
    path, so hugetlb_entry() callbacks must check it. The reason for this
    behavior is that some callers want to handle it in its own way.

    [ I think that reason is bogus, btw - it should just do what the regular
    code does, which is to call the "pte_hole()" function for such hugetlb
    entries - Linus]

    However, some callers don't check it now, which causes unpredictable
    result, for example when we have a race between migrating hugepage and
    reading /proc/pid/numa_maps. This patch fixes it by adding !pte_present
    checks on buggy callbacks.

    This bug exists for years and got visible by introducing hugepage
    migration.

    ChangeLog v2:
    - fix if condition (check !pte_present() instead of pte_present())

    Reported-by: Sasha Levin
    Signed-off-by: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    [ Backported to 3.15. Signed-off-by: Josh Boyer ]
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

05 Jun, 2014

2 commits

  • Pull x86 cdso updates from Peter Anvin:
    "Vdso cleanups and improvements largely from Andy Lutomirski. This
    makes the vdso a lot less ''special''"

    * 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/vdso, build: Make LE access macros clearer, host-safe
    x86/vdso, build: Fix cross-compilation from big-endian architectures
    x86/vdso, build: When vdso2c fails, unlink the output
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
    x86, mm: Improve _install_special_mapping and fix x86 vdso naming
    mm, fs: Add vm_ops->name as an alternative to arch_vma_name
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
    x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
    x86, vdso: Move the 32-bit vdso special pages after the text
    x86, vdso: Reimplement vdso.so preparation in build-time C
    x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
    x86, vdso: Clean up 32-bit vs 64-bit vdso params
    x86, mm: Ensure correct alignment of the fixmap

    Linus Torvalds
     
  • clear_refs_write() is called earlier than clear_soft_dirty() and it is
    more natural to clear VM_SOFTDIRTY (which belongs to VMA entry but not
    PTEs) that early instead of clearing it a way deeper inside call chain.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

21 May, 2014

1 commit

  • arch_vma_name sucks. It's a silly hack, and it's annoying to
    implement correctly. In fact, AFAICS, even the straightforward x86
    implementation is incorrect (I suspect that it breaks if the vdso
    mapping is split or gets remapped).

    This adds a new vm_ops->name operation that can replace it. The
    followup patches will remove all uses of arch_vma_name on x86,
    fixing a couple of annoyances in the process.

    Signed-off-by: Andy Lutomirski
    Link: http://lkml.kernel.org/r/2eee21791bb36a0a408c5c2bdb382a9e6a41ca4a.1400538962.git.luto@amacapital.net
    Signed-off-by: H. Peter Anvin

    Andy Lutomirski
     

08 Apr, 2014

1 commit

  • This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

15 Nov, 2013

3 commits

  • All seq_printf() users are using "%n" for calculating padding size,
    convert them to use seq_setwidth() / seq_pad() pair.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Kees Cook
    Cc: Joe Perches
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • With split ptlock it's important to know which lock
    pmd_trans_huge_lock() took. This patch adds one more parameter to the
    function to return the lock.

    In most places migration to new api is trivial. Exception is
    move_huge_pmd(): we need to take two locks if pmd tables are different.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With split page table lock for PMD level we can't hold mm->page_table_lock
    while updating nr_ptes.

    Let's convert it to atomic_long_t to avoid races.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Nov, 2013

2 commits

  • This flag shows that the VMA is "newly created" and thus represents
    "dirty" in the task's VM.

    You can clear it by "echo 4 > /proc/pid/clear_refs."

    Signed-off-by: Naoya Horiguchi
    Cc: Wu Fengguang
    Cc: Pavel Emelyanov
    Acked-by: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • mpol_to_str() should not fail. Currently, it either fails because the
    string buffer is too small or because a string hasn't been defined for a
    mempolicy mode.

    If a new mempolicy mode is introduced and no string is defined for it,
    just warn and return "unknown".

    If the buffer is too small, just truncate the string and return, the
    same behavior as snprintf().

    This also fixes a bug where there was no NULL-byte termination when doing
    *p++ = '=' and *p++ ':' and maxlen has been reached.

    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Chen Gang
    Cc: Rik van Riel
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

17 Oct, 2013

1 commit

  • If a page we are inspecting is in swap we may occasionally report it as
    having soft dirty bit (even if it is clean). The pte_soft_dirty helper
    should be called on present pte only.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Andy Lutomirski
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

12 Sep, 2013

2 commits

  • mpol_to_str() may fail, and not fill the buffer (e.g. -EINVAL), so need
    check about it, or buffer may not be zero based, and next seq_printf()
    will cause issue.

    The failure return need after mpol_cond_put() to match get_vma_policy().

    Signed-off-by: Chen Gang
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Pavel reported that in case if vma area get unmapped and then mapped (or
    expanded) in-place, the soft dirty tracker won't be able to recognize this
    situation since it works on pte level and ptes are get zapped on unmap,
    loosing soft dirty bit of course.

    So to resolve this situation we need to track actions on vma level, there
    VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded)
    we set this bit, and keep it here until application calls for clearing
    soft dirty bit.

    Thus when user space application track memory changes now it can detect if
    vma area is renewed.

    Reported-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Cc: Andy Lutomirski
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

14 Aug, 2013

3 commits

  • Recently we met quite a lot of random kernel panic issues after enabling
    CONFIG_PROC_PAGE_MONITOR. After debuggind we found this has something
    to do with following bug in pagemap:

    In struct pagemapread:

    struct pagemapread {
    int pos, len;
    pagemap_entry_t *buffer;
    bool v2;
    };

    pos is number of PM_ENTRY_BYTES in buffer, but len is the size of
    buffer, it is a mistake to compare pos and len in add_page_map() for
    checking buffer is full or not, and this can lead to buffer overflow and
    random kernel panic issue.

    Correct len to be total number of PM_ENTRY_BYTES in buffer.

    [akpm@linux-foundation.org: document pagemapread.pos and .len units, fix PM_ENTRY_BYTES definition]
    Signed-off-by: Yonghua Zheng
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yonghua zheng
     
  • Andy reported that if file page get reclaimed we lose the soft-dirty bit
    if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
    encoded into pte entry. Thus when #pf happens on such non-present pte
    we can restore it back.

    Reported-by: Andy Lutomirski
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Minchan Kim
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set
    get swapped out, the bit is getting lost and no longer available when
    pte read back.

    To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in
    pte entry for the page being swapped out. When such page is to be read
    back from a swap cache we check for bit presence and if it's there we
    clear it and restore the former _PAGE_SOFT_DIRTY bit back.

    One of the problem was to find a place in pte entry where we can save
    the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was
    chosen for that, it doesn't intersect with swap entry format stored in
    pte.

    Reported-by: Andy Lutomirski
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Reviewed-by: Minchan Kim
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

04 Jul, 2013

2 commits

  • In order to reuse bits from pagemap entries gracefully, we leave the
    entries as is but on pagemap open emit a warning in dmesg, that bits
    55-60 are about to change in a couple of releases. Next, if a user
    issues soft-dirty clear command via the clear_refs file (it was disabled
    before v3.9) we assume that he's aware of the new pagemap format, note
    that fact and report the bits in pagemap in the new manner.

    The "migration strategy" looks like this then:

    1. existing users are not affected -- they don't touch soft-dirty feature, thus
    see old bits in pagemap, but are warned and have time to fix themselves
    2. those who use soft-dirty know about new pagemap format
    3. some time soon we get rid of any signs of page-shift in pagemap as well as
    this trick with clear-soft-dirty affecting pagemap format.

    Signed-off-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Glauber Costa
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The soft-dirty is a bit on a PTE which helps to track which pages a task
    writes to. In order to do this tracking one should

    1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
    2. Wait some time.
    3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)

    To do this tracking, the writable bit is cleared from PTEs when the
    soft-dirty bit is. Thus, after this, when the task tries to modify a
    page at some virtual address the #PF occurs and the kernel sets the
    soft-dirty bit on the respective PTE.

    Note, that although all the task's address space is marked as r/o after
    the soft-dirty bits clear, the #PF-s that occur after that are processed
    fast. This is so, since the pages are still mapped to physical memory,
    and thus all the kernel does is finds this fact out and puts back
    writable, dirty and soft-dirty bits on the PTE.

    Another thing to note, is that when mremap moves PTEs they are marked
    with soft-dirty as well, since from the user perspective mremap modifies
    the virtual memory at mremap's new address.

    Signed-off-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Glauber Costa
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov