24 Mar, 2012

1 commit

  • The motivation for this patchset was that I was looking at a way for a
    qemu-kvm process, to exclude the guest memory from its core dump, which
    can be quite large. There are already a number of filter flags in
    /proc//coredump_filter, however, these allow one to specify 'types'
    of kernel memory, not specific address ranges (which is needed in this
    case).

    Since there are no more vma flags available, the first patch eliminates
    the need for the 'VM_ALWAYSDUMP' flag. The flag is used internally by
    the kernel to mark vdso and vsyscall pages. However, it is simple
    enough to check if a vma covers a vdso or vsyscall page without the need
    for this flag.

    The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new
    'VM_NODUMP' flag, which can be set by userspace using new madvise flags:
    'MADV_DONTDUMP', and unset via 'MADV_DODUMP'. The core dump filters
    continue to work the same as before unless 'MADV_DONTDUMP' is set on the
    region.

    The qemu code which implements this features is at:

    http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch

    In my testing the qemu core dump shrunk from 383MB -> 13MB with this
    patch.

    I also believe that the 'MADV_DONTDUMP' flag might be useful for
    security sensitive apps, which might want to select which areas are
    dumped.

    This patch:

    The VM_ALWAYSDUMP flag is currently used by the coredump code to
    indicate that a vma is part of a vsyscall or vdso section. However, we
    can determine if a vma is in one these sections by checking it against
    the gate_vma and checking for a non-NULL return value from
    arch_vma_name(). Thus, freeing a valuable vma bit.

    Signed-off-by: Jason Baron
    Acked-by: Roland McGrath
    Cc: Chris Metcalf
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

23 Mar, 2012

1 commit

  • Merge first batch of patches from Andrew Morton:
    "A few misc things and all the MM queue"

    * emailed from Andrew Morton : (92 commits)
    memcg: avoid THP split in task migration
    thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
    memcg: clean up existing move charge code
    mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
    mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
    mm/memcontrol.c: s/stealed/stolen/
    memcg: fix performance of mem_cgroup_begin_update_page_stat()
    memcg: remove PCG_FILE_MAPPED
    memcg: use new logic for page stat accounting
    memcg: remove PCG_MOVE_LOCK flag from page_cgroup
    memcg: simplify move_account() check
    memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
    memcg: kill dead prev_priority stubs
    memcg: remove PCG_CACHE page_cgroup flag
    memcg: let css_get_next() rely upon rcu_read_lock()
    cgroup: revert ss_id_lock to spinlock
    idr: make idr_get_next() good for rcu_read_lock()
    memcg: remove unnecessary thp check in page stat accounting
    memcg: remove redundant returns
    memcg: enum lru_list lru
    ...

    Linus Torvalds
     

22 Mar, 2012

5 commits

  • There's no difference between sync_mm_rss() and __sync_task_rss_stat(),
    so fold the latter into the former.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • sync_mm_rss() can only be used for current to avoid race conditions in
    iterating and clearing its per-task counters. Remove the task argument
    for it and its helper function, __sync_task_rss_stat(), to avoid thinking
    it can be used safely for anything other than current.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Make get_mm_counter() always static inline, it is simple enough for that.
    And remove unused set_mm_counter()

    bloat-o-meter:

    add/remove: 0/1 grow/shrink: 4/12 up/down: 99/-341 (-242)
    function old new delta
    try_to_unmap_one 886 952 +66
    sys_remap_file_pages 1214 1230 +16
    dup_mm 1684 1700 +16
    do_exit 2277 2278 +1
    zap_page_range 208 205 -3
    unmap_region 304 296 -8
    static.oom_kill_process 554 546 -8
    try_to_unmap_file 1716 1700 -16
    getrusage 925 909 -16
    flush_old_exec 1704 1688 -16
    static.dump_header 416 390 -26
    acct_update_integrals 218 187 -31
    do_task_stat 2986 2954 -32
    get_mm_counter 34 - -34
    xacct_add_tsk 371 334 -37
    task_statm 172 118 -54
    task_mem 383 323 -60

    try_to_unmap_one() grows because update_hiwater_rss() now completely inline.

    Signed-off-by: Konstantin Khlebnikov
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • In some cases it may happen that pmd_none_or_clear_bad() is called with
    the mmap_sem hold in read mode. In those cases the huge page faults can
    allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
    false positive from pmd_bad() that will not like to see a pmd
    materializing as trans huge.

    It's not khugepaged causing the problem, khugepaged holds the mmap_sem
    in write mode (and all those sites must hold the mmap_sem in read mode
    to prevent pagetables to go away from under them, during code review it
    seems vm86 mode on 32bit kernels requires that too unless it's
    restricted to 1 thread per process or UP builds). The race is only with
    the huge pagefaults that can convert a pmd_none() into a
    pmd_trans_huge().

    Effectively all these pmd_none_or_clear_bad() sites running with
    mmap_sem in read mode are somewhat speculative with the page faults, and
    the result is always undefined when they run simultaneously. This is
    probably why it wasn't common to run into this. For example if the
    madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
    fault, the hugepage will not be zapped, if the page fault runs first it
    will be zapped.

    Altering pmd_bad() not to error out if it finds hugepmds won't be enough
    to fix this, because zap_pmd_range would then proceed to call
    zap_pte_range (which would be incorrect if the pmd become a
    pmd_trans_huge()).

    The simplest way to fix this is to read the pmd in the local stack
    (regardless of what we read, no need of actual CPU barriers, only
    compiler barrier needed), and be sure it is not changing under the code
    that computes its value. Even if the real pmd is changing under the
    value we hold on the stack, we don't care. If we actually end up in
    zap_pte_range it means the pmd was not none already and it was not huge,
    and it can't become huge from under us (khugepaged locking explained
    above).

    All we need is to enforce that there is no way anymore that in a code
    path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
    can run into a hugepmd. The overhead of a barrier() is just a compiler
    tweak and should not be measurable (I only added it for THP builds). I
    don't exclude different compiler versions may have prevented the race
    too by caching the value of *pmd on the stack (that hasn't been
    verified, but it wouldn't be impossible considering
    pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
    and there's no external function called in between pmd_trans_huge and
    pmd_none_or_clear_bad).

    if (pmd_trans_huge(*pmd)) {
    if (next-addr != HPAGE_PMD_SIZE) {
    VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
    split_huge_page_pmd(vma->vm_mm, pmd);
    } else if (zap_huge_pmd(tlb, vma, pmd, addr))
    continue;
    /* fall through */
    }
    if (pmd_none_or_clear_bad(pmd))

    Because this race condition could be exercised without special
    privileges this was reported in CVE-2012-1179.

    The race was identified and fully explained by Ulrich who debugged it.
    I'm quoting his accurate explanation below, for reference.

    ====== start quote =======
    mapcount 0 page_mapcount 1
    kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

    mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

    143 void pmd_clear_bad(pmd_t *pmd)
    144 {
    -> 145 pmd_ERROR(*pmd);
    146 pmd_clear(pmd);
    147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

    1381 if (mapcount != page_mapcount(page))
    1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
    1383 mapcount, page_mapcount(page));
    -> 1384 BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

    virtual address space
    .---------------------.
    | |
    | |
    .-|---------------------|
    | | |
    | | |< |/////////////////////| > A(range)
    page | |/////////////////////|-'
    | | |
    | | |
    '-|---------------------|
    | |
    | |
    '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
    on the virtual address range "A(range)" shown in the picture.

    sys_madvise
    // Acquire the semaphore in shared mode.
    down_read(¤t->mm->mmap_sem)
    ...
    madvise_vma
    switch (behavior)
    case MADV_DONTNEED:
    madvise_dontneed
    zap_page_range
    unmap_vmas
    unmap_page_range
    zap_pud_range
    zap_pmd_range
    //
    // Assume that this huge page has never been accessed.
    // I.e. content of the PMD entry is zero (not mapped).
    //
    if (pmd_trans_huge(*pmd)) {
    // We don't get here due to the above assumption.
    }
    //
    // Assume that Thread B incurred a page fault and
    .---------> // sneaks in here as shown below.
    | //
    | if (pmd_none_or_clear_bad(pmd))
    | {
    | if (unlikely(pmd_bad(*pmd)))
    | pmd_clear_bad
    | {
    | pmd_ERROR
    | // Log "bad pmd ..." message here.
    | pmd_clear
    | // Clear the page's PMD entry.
    | // Thread B incremented the map count
    | // in page_add_new_anon_rmap(), but
    | // now the page is no longer mapped
    | // by a PMD entry (-> inconsistency).
    | }
    | }
    |
    v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
    in the picture.

    ...
    do_page_fault
    __do_page_fault
    // Acquire the semaphore in shared mode.
    down_read_trylock(&mm->mmap_sem)
    ...
    handle_mm_fault
    if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
    // We get here due to the above assumption (PMD entry is zero).
    do_huge_pmd_anonymous_page
    alloc_hugepage_vma
    // Allocate a new transparent huge page here.
    ...
    __do_huge_pmd_anonymous_page
    ...
    spin_lock(&mm->page_table_lock)
    ...
    page_add_new_anon_rmap
    // Here we increment the page's map count (starts at -1).
    atomic_set(&page->_mapcount, 0)
    set_pmd_at
    // Here we set the page's PMD entry which will be cleared
    // when Thread A calls pmd_clear_bad().
    ...
    spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read). Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated. However, Thread A
    does not synchronize on that lock.

    ====== end quote =======

    [akpm@linux-foundation.org: checkpatch fixes]
    Reported-by: Ulrich Obergfell
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Dave Jones
    Acked-by: Larry Woodman
    Acked-by: Rik van Riel
    Cc: [2.6.38+]
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Pull munmap/truncate race fixes from Al Viro:
    "Fixes for racy use of unmap_vmas() on truncate-related codepaths"

    * 'vm' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VM: make zap_page_range() callers that act on a single VMA use separate helper
    VM: make unmap_vmas() return void
    VM: don't bother with feeding upper limit to tlb_finish_mmu() in exit_mmap()
    VM: make zap_page_range() return void
    VM: can't go through the inner loop in unmap_vmas() more than once...
    VM: unmap_page_range() can return void

    Linus Torvalds
     

21 Mar, 2012

5 commits


20 Mar, 2012

1 commit


24 Jan, 2012

1 commit

  • Memory migration fills a pte with a migration entry and it doesn't
    update the rss counters. Then it replaces the migration entry with the
    new page (or the old one if migration failed). But between these two
    passes this pte can be unmaped, or a task can fork a child and it will
    get a copy of this migration entry. Nobody accounts for this in the rss
    counters.

    This patch properly adjust rss counters for migration entries in
    zap_pte_range() and copy_one_pte(). Thus we avoid extra atomic
    operations on the migration fast-path.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

13 Jan, 2012

1 commit

  • We have tlb_remove_tlb_entry to indicate a pte tlb flush entry should be
    flushed, but not a corresponding API for pmd entry. This isn't a
    problem so far because THP is only for x86 currently and tlb_flush()
    under x86 will flush entire TLB. But this is confusion and could be
    missed if thp is ported to other arch.

    Also convert tlb->need_flush = 1 to a VM_BUG_ON(!tlb->need_flush) in
    __tlb_remove_page() as suggested by Andrea Arcangeli. The
    __tlb_remove_page() function is supposed to be called after
    tlb_remove_xxx_tlb_entry() and we can catch any misuse.

    Signed-off-by: Shaohua Li
    Reviewed-by: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

03 Nov, 2011

1 commit

  • Michel while working on the working set estimation code, noticed that
    calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
    wasn't safe, if the pfn ended up being a tail page of a transparent
    hugepage under splitting by __split_huge_page_refcount().

    He then found the problem could also theoretically materialize with
    page_cache_get_speculative() during the speculative radix tree lookups
    that uses get_page_unless_zero() in SMP if the radix tree page is freed
    and reallocated and get_user_pages is called on it before
    page_cache_get_speculative has a chance to call get_page_unless_zero().

    So the best way to fix the problem is to keep page_tail->_count zero at
    all times. This will guarantee that get_page_unless_zero() can never
    succeed on any tail page. page_tail->_mapcount is guaranteed zero and
    is unused for all tail pages of a compound page, so we can simply
    account the tail page references there and transfer them to
    tail_page->_count in __split_huge_page_refcount() (in addition to the
    head_page->_mapcount).

    While debugging this s/_count/_mapcount/ change I also noticed get_page is
    called by direct-io.c on pages returned by get_user_pages. That wasn't
    entirely safe because the two atomic_inc in get_page weren't atomic. As
    opposed to other get_user_page users like secondary-MMU page fault to
    establish the shadow pagetables would never call any superflous get_page
    after get_user_page returns. It's safer to make get_page universally safe
    for tail pages and to use get_page_foll() within follow_page (inside
    get_user_pages()). get_page_foll() is safe to do the refcounting for tail
    pages without taking any locks because it is run within PT lock protected
    critical sections (PT lock for pte and page_table_lock for
    pmd_trans_huge).

    The standard get_page() as invoked by direct-io instead will now take
    the compound_lock but still only for tail pages. The direct-io paths
    are usually I/O bound and the compound_lock is per THP so very
    finegrined, so there's no risk of scalability issues with it. A simple
    direct-io benchmarks with all lockdep prove locking and spinlock
    debugging infrastructure enabled shows identical performance and no
    overhead. So it's worth it. Ideally direct-io should stop calling
    get_page() on pages returned by get_user_pages(). The spinlock in
    get_page() is already optimized away for no-THP builds but doing
    get_page() on tail pages returned by GUP is generally a rare operation
    and usually only run in I/O paths.

    This new refcounting on page_tail->_mapcount in addition to avoiding new
    RCU critical sections will also allow the working set estimation code to
    work without any further complexity associated to the tail page
    refcounting with THP.

    Signed-off-by: Andrea Arcangeli
    Reported-by: Michel Lespinasse
    Reviewed-by: Michel Lespinasse
    Reviewed-by: Minchan Kim
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

31 Oct, 2011

1 commit


26 Jul, 2011

3 commits

  • I haven't reproduced it myself but the fail scenario is that on such
    machines (notably ARM and some embedded powerpc), if you manage to hit
    that futex path on a writable page whose dirty bit has gone from the PTE,
    you'll livelock inside the kernel from what I can tell.

    It will go in a loop of trying the atomic access, failing, trying gup to
    "fix it up", getting succcess from gup, go back to the atomic access,
    failing again because dirty wasn't fixed etc...

    So I think you essentially hang in the kernel.

    The scenario is probably rare'ish because affected architecture are
    embedded and tend to not swap much (if at all) so we probably rarely hit
    the case where dirty is missing or young is missing, but I think Shan has
    a piece of SW that can reliably reproduce it using a shared writable
    mapping & fork or something like that.

    On archs who use SW tracking of dirty & young, a page without dirty is
    effectively mapped read-only and a page without young unaccessible in the
    PTE.

    Additionally, some architectures might lazily flush the TLB when relaxing
    write protection (by doing only a local flush), and expect a fault to
    invalidate the stale entry if it's still present on another processor.

    The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
    "fix it up" by causing get_user_pages() which would then be equivalent to
    taking the fault.

    However that isn't the case. get_user_pages() will not call
    handle_mm_fault() in the case where the PTE seems to have the right
    permissions, regardless of the dirty and young state. It will eventually
    update those bits ... in the struct page, but not in the PTE.

    Additionally, it will not handle the lazy TLB flushing that can be
    required by some architectures in the fault case.

    Basically, gup is the wrong interface for the job. The patch provides a
    more appropriate one which boils down to just calling handle_mm_fault()
    since what we are trying to do is simulate a real page fault.

    The futex code currently attempts to write to user memory within a
    pagefault disabled section, and if that fails, tries to fix it up using
    get_user_pages().

    This doesn't work on archs where the dirty and young bits are maintained
    by software, since they will gate access permission in the TLB, and will
    not be updated by gup().

    In addition, there's an expectation on some archs that a spurious write
    fault triggers a local TLB flush, and that is missing from the picture as
    well.

    I decided that adding those "features" to gup() would be too much for this
    already too complex function, and instead added a new simpler
    fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
    which the futex code can call.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
    Signed-off-by: Benjamin Herrenschmidt
    Reported-by: Shan Hai
    Tested-by: Shan Hai
    Cc: David Laight
    Acked-by: Peter Zijlstra
    Cc: Darren Hart
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Currently we are keeping faulted page locked throughout whole __do_fault
    call (except for page_mkwrite code path) after calling file system's fault
    code. If we do early COW, we allocate a new page which has to be charged
    for a memcg (mem_cgroup_newpage_charge).

    This function, however, might block for unbounded amount of time if memcg
    oom killer is disabled or fork-bomb is running because the only way out of
    the OOM situation is either an external event or OOM-situation fix.

    In the end we are keeping the faulted page locked and blocking other
    processes from faulting it in which is not good at all because we are
    basically punishing potentially an unrelated process for OOM condition in
    a different group (I have seen stuck system because of ld-2.11.1.so being
    locked).

    We can do test easily.

    % cgcreate -g memory:A
    % cgset -r memory.limit_in_bytes=64M A
    % cgset -r memory.memsw.limit_in_bytes=64M A
    % cd kernel_dir; cgexec -g memory:A make -j

    Then, the whole system will live-locked until you kill 'make -j'
    by hands (or push reboot...) This is because some important page in a
    a shared library are locked.

    Considering again, the new page is not necessary to be allocated
    with lock_page() held. And usual page allocation may dive into
    long memory reclaim loop with holding lock_page() and can cause
    very long latency.

    There are 3 ways.
    1. do allocation/charge before lock_page()
    Pros. - simple and can handle page allocation in the same manner.
    This will reduce holding time of lock_page() in general.
    Cons. - we do page allocation even if ->fault() returns error.

    2. do charge after unlock_page(). Even if charge fails, it's just OOM.
    Pros. - no impact to non-memcg path.
    Cons. - implemenation requires special cares of LRU and we need to modify
    page_add_new_anon_rmap()...

    3. do unlock->charge->lock again method.
    Pros. - no impact to non-memcg path.
    Cons. - This may kill LOCK_PAGE_RETRY optimization. We need to release
    lock and get it again...

    This patch moves "charge" and memory allocation for COW page
    before lock_page(). Then, we can avoid scanning LRU with holding
    a lock on a page and latency under lock_page() will be reduced.

    Then, above livelock disappears.

    [akpm@linux-foundation.org: fix code layout]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reported-by: Lutz Vieweg
    Original-idea-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Johannes Weiner
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • ZAP_BLOCK_SIZE became unused in the preemptible-mmu_gather work ("mm:
    Remove i_mmap_lock lockbreak"). So zap it.

    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

09 Jul, 2011

1 commit


28 Jun, 2011

1 commit


16 Jun, 2011

2 commits

  • Running a ktest.pl test, I hit the following bug on x86_32:

    ------------[ cut here ]------------
    WARNING: at arch/x86/mm/highmem_32.c:81 __kunmap_atomic+0x64/0xc1()
    Hardware name:
    Modules linked in:
    Pid: 93, comm: sh Not tainted 2.6.39-test+ #1
    Call Trace:
    [] warn_slowpath_common+0x7c/0x91
    [] ? __kunmap_atomic+0x64/0xc1
    [] ? __kunmap_atomic+0x64/0xc1^M
    [] warn_slowpath_null+0x22/0x24
    [] __kunmap_atomic+0x64/0xc1
    [] unmap_vmas+0x43a/0x4e0
    [] exit_mmap+0x91/0xd2
    [] mmput+0x43/0xad
    [] exit_mm+0x111/0x119
    [] do_exit+0x1ff/0x5fa
    [] ? set_current_blocked+0x3c/0x40
    [] ? sigprocmask+0x7e/0x8e
    [] do_group_exit+0x65/0x88
    [] sys_exit_group+0x18/0x1c
    [] sysenter_do_call+0x12/0x38
    ---[ end trace 8055f74ea3c0eb62 ]---

    Running a ktest.pl git bisect, found the culprit: commit e303297e6c3a
    ("mm: extended batches for generic mmu_gather")

    But although this was the commit triggering the bug, it was not the one
    originally responsible for the bug. That was commit d16dfc550f53 ("mm:
    mmu_gather rework").

    The code in zap_pte_range() has something that looks like the following:

    pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
    do {
    [...]
    } while (pte++, addr += PAGE_SIZE, addr != end);
    pte_unmap_unlock(pte - 1, ptl);

    The pte starts off pointing at the first element in the page table
    directory that was returned by the pte_offset_map_lock(). When it's done
    with the page, pte will be pointing to anything between the next entry and
    the first entry of the next page inclusive. By doing a pte - 1, this puts
    the pte back onto the original page, which is all that pte_unmap_unlock()
    needs.

    In most archs (64 bit), this is not an issue as the pte is ignored in the
    pte_unmap_unlock(). But on 32 bit archs, where things may be kmapped, it
    is essential that the pte passed to pte_unmap_unlock() resides on the same
    page that was given by pte_offest_map_lock().

    The problem came in d16dfc55 ("mm: mmu_gather rework") where it introduced
    a "break;" from the while loop. This alone did not seem to easily trigger
    the bug. But the modifications made by e303297e6 caused that "break;" to
    be hit on the first iteration, before the pte++.

    The pte not being incremented will now cause pte_unmap_unlock(pte - 1) to
    be pointing to the previous page. This will cause the wrong page to be
    unmapped, and also trigger the warning above.

    The simple solution is to just save the pointer given by
    pte_offset_map_lock() and use it in the unlock.

    Signed-off-by: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • Fix new kernel-doc warnings in mm/memory.c:

    Warning(mm/memory.c:1327): No description found for parameter 'tlb'
    Warning(mm/memory.c:1327): Excess function parameter 'tlbp' description in 'unmap_vmas'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

27 May, 2011

2 commits

  • Two new stats in per-memcg memory.stat which tracks the number of page
    faults and number of major page faults.

    "pgfault"
    "pgmajfault"

    They are different from "pgpgin"/"pgpgout" stat which count number of
    pages charged/discharged to the cgroup and have no meaning of reading/
    writing page to disk.

    It is valuable to track the two stats for both measuring application's
    performance as well as the efficiency of the kernel page reclaim path.
    Counting pagefaults per process is useful, but we also need the aggregated
    value since processes are monitored and controlled in cgroup basis in
    memcg.

    Functional test: check the total number of pgfault/pgmajfault of all
    memcgs and compare with global vmstat value:

    $ cat /proc/vmstat | grep fault
    pgfault 1070751
    pgmajfault 553

    $ cat /dev/cgroup/memory.stat | grep fault
    pgfault 1071138
    pgmajfault 553
    total_pgfault 1071142
    total_pgmajfault 553

    $ cat /dev/cgroup/A/memory.stat | grep fault
    pgfault 199
    pgmajfault 0
    total_pgfault 199
    total_pgmajfault 0

    Performance test: run page fault test(pft) wit 16 thread on faulting in
    15G anon pages in 16G container. There is no regression noticed on the
    "flt/cpu/s"

    Sample output from pft:

    TAG pft:anon-sys-default:
    Gb Thr CLine User System Wall flt/cpu/s fault/wsec
    15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260

    +-------------------------------------------------------------------------+
    N Min Max Median Avg Stddev
    x 10 16682.962 17344.027 16913.524 16928.812 166.5362
    + 10 16695.568 16923.896 16820.604 16824.652 84.816568
    No difference proven at 95.0% confidence

    [akpm@linux-foundation.org: fix build]
    [hughd@google.com: shmem fix]
    Signed-off-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
    'unsigned int'. This patch fixes such misuse.

    Signed-off-by: KOSAKI Motohiro
    [ Changed to use a typedef - we'll extend it to cover more cases
    later, since there has been discussion about making it a 64-bit
    type.. - Linus ]
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

25 May, 2011

7 commits

  • Some of these functions have grown beyond inline sanity, move them
    out-of-line.

    Signed-off-by: Peter Zijlstra
    Requested-by: Andrew Morton
    Requested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Straightforward conversion of i_mmap_lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Hugh says:
    "The only significant loser, I think, would be page reclaim (when
    concurrent with truncation): could spin for a long time waiting for
    the i_mmap_mutex it expects would soon be dropped? "

    Counter points:
    - cpu contention makes the spin stop (need_resched())
    - zap pages should be freeing pages at a higher rate than reclaim
    ever can

    I think the simplification of the truncate code is definitely worth it.

    Effectively reverts: 2aa15890f3c ("mm: prevent concurrent
    unmap_mapping_range() on the same inode") and takes out the code that
    caused its problem.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Instead of using a single batch (the small on-stack, or an allocated
    page), try and extend the batch every time it runs out and only flush once
    either the extend fails or we're done.

    Signed-off-by: Peter Zijlstra
    Requested-by: Nick Piggin
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • In case other architectures require RCU freed page-tables to implement
    gup_fast() and software filled hashes and similar things, provide the
    means to do so by moving the logic into generic code.

    Signed-off-by: Peter Zijlstra
    Requested-by: David Miller
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Rework the existing mmu_gather infrastructure.

    The direct purpose of these patches was to allow preemptible mmu_gather,
    but even without that I think these patches provide an improvement to the
    status quo.

    The first 9 patches rework the mmu_gather infrastructure. For review
    purpose I've split them into generic and per-arch patches with the last of
    those a generic cleanup.

    The next patch provides generic RCU page-table freeing, and the followup
    is a patch converting s390 to use this. I've also got 4 patches from
    DaveM lined up (not included in this series) that uses this to implement
    gup_fast() for sparc64.

    Then there is one patch that extends the generic mmu_gather batching.

    After that follow the mm preemptibility patches, these make part of the mm
    a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to
    mutexes which together with the mmu_gather rework makes mmu_gather
    preemptible as well.

    Making i_mmap_lock a mutex also enables a clean-up of the truncate code.

    This also allows for preemptible mmu_notifiers, something that XPMEM I
    think wants.

    Furthermore, it removes the new and universially detested unmap_mutex.

    This patch:

    Remove the first obstacle towards a fully preemptible mmu_gather.

    The current scheme assumes mmu_gather is always done with preemption
    disabled and uses per-cpu storage for the page batches. Change this to
    try and allocate a page for batching and in case of failure, use a small
    on-stack array to make some progress.

    Preemptible mmu_gather is desired in general and usable once i_mmap_lock
    becomes a mutex. Doing it before the mutex conversion saves us from
    having to rework the code by moving the mmu_gather bits inside the
    pte_lock.

    Also avoid flushing the tlb batches from under the pte lock, this is
    useful even without the i_mmap_lock conversion as it significantly reduces
    pte lock hold times.

    [akpm@linux-foundation.org: fix comment tpyo]
    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Currently we have expand_upwards exported while expand_downwards is
    accessible only via expand_stack or expand_stack_downwards.

    check_stack_guard_page is a nice example of the asymmetry. It uses
    expand_stack for VM_GROWSDOWN while expand_upwards is called for
    VM_GROWSUP case.

    Let's clean this up by exporting both functions and make those names
    consistent. Let's use expand_{upwards,downwards} because expanding
    doesn't always involve stack manipulation (an example is
    ia64_do_page_fault which uses expand_upwards for registers backing store
    expansion). expand_downwards has to be defined for both
    CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
    version in the early process initialization phase for growsup
    configuration.

    Signed-off-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: James Bottomley
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

10 May, 2011

1 commit

  • Linux kernel excludes guard page when performing mlock on a VMA with
    down-growing stack. However, some architectures have up-growing stack
    and locking the guard page should be excluded in this case too.

    This patch fixes lvm2 on PA-RISC (and possibly other architectures with
    up-growing stack). lvm2 calculates number of used pages when locking and
    when unlocking and reports an internal error if the numbers mismatch.

    [ Patch changed fairly extensively to also fix /proc//maps for the
    grows-up case, and to move things around a bit to clean it all up and
    share the infrstructure with the /proc bits.

    Tested on ia64 that has both grow-up and grow-down segments - Linus ]

    Signed-off-by: Mikulas Patocka
    Tested-by: Tony Luck
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

05 May, 2011

1 commit

  • The logic in __get_user_pages() used to skip the stack guard page lookup
    whenever the caller wasn't interested in seeing what the actual page
    was. But Michel Lespinasse points out that there are cases where we
    don't care about the physical page itself (so 'pages' may be NULL), but
    do want to make sure a page is mapped into the virtual address space.

    So using the existence of the "pages" array as an indication of whether
    to look up the guard page or not isn't actually so great, and we really
    should just use the FOLL_MLOCK bit. But because that bit was only set
    for the VM_LOCKED case (and not all vma's necessarily have it, even for
    mlock()), we couldn't do that originally.

    Fix that by moving the VM_LOCKED check deeper into the call-chain, which
    actually simplifies many things. Now mlock() gets simpler, and we can
    also check for FOLL_MLOCK in __get_user_pages() and the code ends up
    much more straightforward.

    Reported-and-reviewed-by: Michel Lespinasse
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Apr, 2011

1 commit

  • With transparent hugepage support, handle_mm_fault() has to be careful
    that a normal PMD has been established before handling a PTE fault. To
    achieve this, it used __pte_alloc() directly instead of pte_alloc_map as
    pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map() is
    called once it is known the PMD is safe.

    pte_alloc_map() is smart enough to check if a PTE is already present
    before calling __pte_alloc but this check was lost. As a consequence,
    PTEs may be allocated unnecessarily and the page table lock taken. Thi
    useless PTE does get cleaned up but it's a performance hit which is
    visible in page_test from aim9.

    This patch simply re-adds the check normally done by pte_alloc_map to
    check if the PTE needs to be allocated before taking the page table lock.
    The effect is noticable in page_test from aim9.

    AIM9
    2.6.38-vanilla 2.6.38-checkptenone
    creat-clo 446.10 ( 0.00%) 424.47 (-5.10%)
    page_test 38.10 ( 0.00%) 42.04 ( 9.37%)
    brk_test 52.45 ( 0.00%) 51.57 (-1.71%)
    exec_test 382.00 ( 0.00%) 456.90 (16.39%)
    fork_test 60.11 ( 0.00%) 67.79 (11.34%)
    MMTests Statistics: duration
    Total Elapsed Time (seconds) 611.90 612.22

    (While this affects 2.6.38, it is a performance rather than a functional
    bug and normally outside the rules -stable. While the big performance
    differences are to a microbench, the difference in fork and exec
    performance may be significant enough that -stable wants to consider the
    patch)

    Reported-by: Raz Ben Yehuda
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Cc: [2.6.38.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

15 Apr, 2011

1 commit

  • In __access_remote_vm() we need to check that we have found the right
    vma, not the following vma before we try to access it. Otherwise we
    might call the vma's access routine with an address which does not fall
    inside the vma.

    It was discovered on a current kernel but with an unreleased driver,
    from memory it was strace leading to a kernel bad access, but it
    obviously depends on what the access implementation does.

    Looking at other access implementations I only see:

    $ git grep -A 5 vm_operations|grep access
    arch/powerpc/platforms/cell/spufs/file.c- .access = spufs_mem_mmap_access,
    arch/x86/pci/i386.c- .access = generic_access_phys,
    drivers/char/mem.c- .access = generic_access_phys
    fs/sysfs/bin.c- .access = bin_access,

    The spufs one looks like it might behave badly given the wrong vma, it
    assumes vma->vm_file->private_data is a spu_context, and looks like it
    would probably blow up pretty quickly if it wasn't.

    generic_access_phys() only uses the vma to check vm_flags and get the
    mm, and then walks page tables using the address. So it should bail on
    the vm_flags check, or at worst let you access some other VM_IO mapping.

    And bin_access() just proxies to another access implementation.

    Signed-off-by: Michael Ellerman
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Ellerman
     

13 Apr, 2011

1 commit

  • Commit 53a7706d5ed8 ("mlock: do not hold mmap_sem for extended periods
    of time") changed mlock() to care about the exact number of pages that
    __get_user_pages() had brought it. Before, it would only care about
    errors.

    And that doesn't work, because we also handled one page specially in
    __mlock_vma_pages_range(), namely the stack guard page. So when that
    case was handled, the number of pages that the function returned was off
    by one. In particular, it could be zero, and then the caller would end
    up not making any progress at all.

    Rather than try to fix up that off-by-one error for the mlock case
    specially, this just moves the logic to handle the stack guard page
    into__get_user_pages() itself, thus making all the counts come out
    right automatically.

    Reported-by: Robert Święcki
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Mar, 2011

1 commit

  • Fix mm/memory.c incorrect kernel-doc function notation:

    Warning(mm/memory.c:3718): Cannot understand * @access_remote_vm - access another process' address space
    on line 3718 - I thought it was a doc line

    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Randy Dunlap