13 Oct, 2014

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - changes related to No-CBs CPUs and NO_HZ_FULL

    - RCU-tasks implementation

    - torture-test updates

    - miscellaneous fixes

    - locktorture updates

    - RCU documentation updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
    workqueue: Use cond_resched_rcu_qs macro
    workqueue: Add quiescent state between work items
    locktorture: Cleanup header usage
    locktorture: Cannot hold read and write lock
    locktorture: Fix __acquire annotation for spinlock irq
    locktorture: Support rwlocks
    rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
    locktorture: Document boot/module parameters
    rcutorture: Rename rcutorture_runnable parameter
    locktorture: Add test scenario for rwsem_lock
    locktorture: Add test scenario for mutex_lock
    locktorture: Make torture scripting account for new _runnable name
    locktorture: Introduce torture context
    locktorture: Support rwsems
    locktorture: Add infrastructure for torturing read locks
    torture: Address race in module cleanup
    locktorture: Make statistics generic
    locktorture: Teach about lock debugging
    locktorture: Support mutexes
    locktorture: Add documentation
    ...

    Linus Torvalds
     

10 Oct, 2014

2 commits

  • Dump the contents of the relevant struct_mm when we hit the bug condition.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
    more information when they trigger.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Sasha Levin
    Reviewed-by: Naoya Horiguchi
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

08 Sep, 2014

1 commit

  • RCU-tasks requires the occasional voluntary context switch
    from CPU-bound in-kernel tasks. In some cases, this requires
    instrumenting cond_resched(). However, there is some reluctance
    to countenance unconditionally instrumenting cond_resched() (see
    http://lwn.net/Articles/603252/), so this commit creates a separate
    cond_resched_rcu_qs() that may be used in place of cond_resched() in
    locations prone to long-duration in-kernel looping.

    This commit currently instruments only RCU-tasks. Future possibilities
    include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
    IPI usage.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

07 Aug, 2014

1 commit

  • Add a comment describing the circumstances in which
    __lock_page_or_retry() will or will not release the mmap_sem when
    returning 0.

    Add comments to lock_page_or_retry()'s callers (filemap_fault(),
    do_swap_page()) noting the impact on VM_FAULT_RETRY returns.

    Add comments on up the call tree, particularly replacing the false "We
    return with mmap_sem still held" comments.

    Signed-off-by: Paul Cassella
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Cassella
     

08 Apr, 2014

1 commit

  • A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
    fuzzing with trinity. The call site try_to_unmap_cluster() does not lock
    the pages other than its check_page parameter (which is already locked).

    The BUG_ON in mlock_vma_page() is not documented and its purpose is
    somewhat unclear, but apparently it serializes against page migration,
    which could otherwise fail to transfer the PG_mlocked flag. This would
    not be fatal, as the page would be eventually encountered again, but
    NR_MLOCK accounting would become distorted nevertheless. This patch adds
    a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
    effect.

    The call site try_to_unmap_cluster() is fixed so that for page !=
    check_page, trylock_page() is attempted (to avoid possible deadlocks as we
    already have check_page locked) and mlock_vma_page() is performed only
    upon success. If the page lock cannot be obtained, the page is left
    without PG_mlocked, which is again not a problem in the whole unevictable
    memory design.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Bob Liu
    Reported-by: Sasha Levin
    Cc: Wanpeng Li
    Cc: Michel Lespinasse
    Cc: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

24 Jan, 2014

2 commits

  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Since commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP
    pages") munlock skips tail pages of a munlocked THP page. There is some
    attempt to prevent bad consequences of racing with a THP page split, but
    code inspection indicates that there are two problems that may lead to a
    non-fatal, yet wrong outcome.

    First, __split_huge_page_refcount() copies flags including PageMlocked
    from the head page to the tail pages. Clearing PageMlocked by
    munlock_vma_page() in the middle of this operation might result in part
    of tail pages left with PageMlocked flag. As the head page still
    appears to be a THP page until all tail pages are processed,
    munlock_vma_page() might think it munlocked the whole THP page and skip
    all the former tail pages. Before ff6a6da60, those pages would be
    cleared in further iterations of munlock_vma_pages_range(), but NR_MLOCK
    would still become undercounted (related the next point).

    Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after
    the PageMlocked is cleared. The accounting might also become
    inconsistent due to race with __split_huge_page_refcount()

    - undercount when HUGE_PMD_NR is subtracted, but some tail pages are
    left with PageMlocked set and counted again (only possible before
    ff6a6da60)

    - overcount when hpage_nr_pages() sees a normal page (split has already
    finished), but the parallel split has meanwhile cleared PageMlocked from
    additional tail pages

    This patch prevents both problems via extending the scope of lru_lock in
    munlock_vma_page(). This is convenient because:

    - __split_huge_page_refcount() takes lru_lock for its whole operation

    - munlock_vma_page() typically takes lru_lock anyway for page isolation

    As this becomes a second function where page isolation is done with
    lru_lock already held, factor this out to a new
    __munlock_isolate_lru_page() function and clean up the code around.

    [akpm@linux-foundation.org: avoid a coding-style ugly]
    Signed-off-by: Vlastimil Babka
    Cc: Sasha Levin
    Cc: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

22 Jan, 2014

1 commit

  • All mlock related syscalls prepare lock limits, lengths and start
    parameters with the mmap_sem held. Move this logic outside of the
    critical region. For the case of mlock, continue incrementing the
    amount already locked by mm->locked_vm with the rwsem taken.

    Signed-off-by: Davidlohr Bueso
    Cc: Rik van Riel
    Reviewed-by: Michel Lespinasse
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

03 Jan, 2014

2 commits

  • Commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
    munlock+putback using pagevec" introduced __munlock_pagevec() to speed
    up munlock by holding lru_lock over multiple isolated pages. Pages that
    fail to be isolated are put_page()d immediately, also within the lock.

    This can lead to deadlock when __munlock_pagevec() becomes the holder of
    the last page pin and put_page() leads to __page_cache_release() which
    also locks lru_lock. The deadlock has been observed by Sasha Levin
    using trinity.

    This patch avoids the deadlock by deferring put_page() operations until
    lru_lock is released. Another pagevec (which is also used by later
    phases of the function is reused to gather the pages for put_page()
    operation.

    Signed-off-by: Vlastimil Babka
    Reported-by: Sasha Levin
    Cc: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Since commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP
    pages") munlock skips tail pages of a munlocked THP page. However, when
    the head page already has PageMlocked unset, it will not skip the tail
    pages.

    Commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and
    munlock+putback using pagevec") has added a PageTransHuge() check which
    contains VM_BUG_ON(PageTail(page)). Sasha Levin found this triggered
    using trinity, on the first tail page of a THP page without PageMlocked
    flag.

    This patch fixes the issue by skipping tail pages also in the case when
    PageMlocked flag is unset. There is still a possibility of race with
    THP page split between clearing PageMlocked and determining how many
    pages to skip. The race might result in former tail pages not being
    skipped, which is however no longer a bug, as during the skip the
    PageTail flags are cleared.

    However this race also affects correctness of NR_MLOCK accounting, which
    is to be fixed in a separate patch.

    Signed-off-by: Vlastimil Babka
    Reported-by: Sasha Levin
    Cc: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Bob Liu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

01 Oct, 2013

1 commit

  • The function __munlock_pagevec_fill() introduced in commit 7a8010cd3627
    ("mm: munlock: manual pte walk in fast path instead of
    follow_page_mask()") uses pmd_addr_end() for restricting its operation
    within current page table.

    This is insufficient on architectures/configurations where pmd is folded
    and pmd_addr_end() just returns the end of the full range to be walked.
    In this case, it allows pte++ to walk off the end of a page table
    resulting in unpredictable behaviour.

    This patch fixes the function by using pgd_addr_end() and pud_addr_end()
    before pmd_addr_end(), which will yield correct page table boundary on
    all configurations. This is similar to what existing page walkers do
    when walking each level of the page table.

    Additionaly, the patch clarifies a comment for get_locked_pte() call in the
    function.

    Signed-off-by: Vlastimil Babka
    Reported-by: Fengguang Wu
    Reviewed-by: Bob Liu
    Cc: Jörn Engel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

25 Sep, 2013

1 commit

  • There is a loop in do_mlockall() that lacks a preemption point, which
    means that the following can happen on non-preemptible builds of the
    kernel. Dave Jones reports:

    "My fuzz tester keeps hitting this. Every instance shows the non-irq
    stack came in from mlockall. I'm only seeing this on one box, but
    that has more ram (8gb) than my other machines, which might explain
    it.

    INFO: rcu_preempt self-detected stall on CPU { 3} (t=6500 jiffies g=470344 c=470343 q=0)
    sending NMI to all CPUs:
    NMI backtrace for cpu 3
    CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #32
    Call Trace:
    lru_add_drain_all+0x15/0x20
    SyS_mlockall+0xa5/0x1a0
    tracesys+0xdd/0xe2"

    This commit addresses this problem by inserting the required preemption
    point.

    Reported-by: Dave Jones
    Signed-off-by: Paul E. McKenney
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     

12 Sep, 2013

6 commits

  • Currently munlock_vma_pages_range() calls follow_page_mask() to obtain
    each individual struct page. This entails repeated full page table
    translations and page table lock taken for each page separately.

    This patch avoids the costly follow_page_mask() where possible, by
    iterating over ptes within single pmd under single page table lock. The
    first pte is obtained by get_locked_pte() for non-THP page acquired by the
    initial follow_page_mask(). The rest of the on-stack pagevec for munlock
    is filled up using pte_walk as long as pte_present() and vm_normal_page()
    are sufficient to obtain the struct page.

    After this patch, a 14% speedup was measured for munlocking a 56GB large
    memory area with THP disabled.

    Signed-off-by: Vlastimil Babka
    Cc: Jörn Engel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The performance of the fast path in munlock_vma_range() can be further
    improved by avoiding atomic ops of a redundant get_page()/put_page() pair.

    When calling get_page() during page isolation, we already have the pin
    from follow_page_mask(). This pin will be then returned by
    __pagevec_lru_add(), after which we do not reference the pages anymore.

    After this patch, an 8% speedup was measured for munlocking a 56GB large
    memory area with THP disabled.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After introducing batching by pagevecs into munlock_vma_range(), we can
    further improve performance by bypassing the copying into per-cpu pagevec
    and the get_page/put_page pair associated with that. Instead we perform
    LRU putback directly from our pagevec. However, this is possible only for
    single-mapped pages that are evictable after munlock. Unevictable pages
    require rechecking after putting on the unevictable list, so for those we
    fallback to putback_lru_page(), hich handles that.

    After this patch, a 13% speedup was measured for munlocking a 56GB large
    memory area with THP disabled.

    [akpm@linux-foundation.org:clarify comment]
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Depending on previous batch which introduced batched isolation in
    munlock_vma_range(), we can batch also the updates of NR_MLOCK page stats.
    After the whole pagevec is processed for page isolation, the stats are
    updated only once with the number of successful isolations. There were
    however no measurable perfomance gains.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Currently, munlock_vma_range() calls munlock_vma_page on each page in a
    loop, which results in repeated taking and releasing of the lru_lock
    spinlock for isolating pages one by one. This patch batches the munlock
    operations using an on-stack pagevec, so that isolation is done under
    single lru_lock. For THP pages, the old behavior is preserved as they
    might be split while putting them into the pagevec. After this patch, a
    9% speedup was measured for munlocking a 56GB large memory area with THP
    disabled.

    A new function __munlock_pagevec() is introduced that takes a pagevec and:
    1) It clears PageMlocked and isolates all pages under lru_lock. Zone page
    stats can be also updated using the variant which assumes disabled
    interrupts. 2) It finishes the munlock and lru putback on all pages under
    their lock_page. Note that previously, lock_page covered also the
    PageMlocked clearing and page isolation, but it is not needed for those
    operations.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In munlock_vma_range(), lru_add_drain() is currently called in a loop
    before each munlock_vma_page() call.

    This is suboptimal for performance when munlocking many pages. The
    benefits of per-cpu pagevec for batching the LRU putback are removed since
    the pagevec only holds at most one page from the previous loop's
    iteration.

    The lru_add_drain() call also does not serve any purposes for correctness
    - it does not even drain pagavecs of all cpu's. The munlock code already
    expects and handles situations where a page cannot be isolated from the
    LRU (e.g. because it is on some per-cpu pagevec).

    The history of the (not commented) call also suggest that it appears there
    as an oversight rather than intentionally. Before commit ff6a6da6 ("mm:
    accelerate munlock() treatment of THP pages") the call happened only once
    upon entering the function. The commit has moved the call into the while
    loope. So while the other changes in the commit improved munlock
    performance for THP pages, it introduced the abovementioned suboptimal
    per-cpu pagevec usage.

    Further in history, before commit 408e82b7 ("mm: munlock use
    follow_page"), munlock_vma_pages_range() was just a wrapper around
    __mlock_vma_pages_range which performed both mlock and munlock depending
    on a flag. However, before ba470de4 ("mmap: handle mlocked pages during
    map, remap, unmap") the function handled only mlock, not munlock. The
    lru_add_drain call thus comes from the implementation in commit b291f000
    ("mlock: mlocked pages are unevictable" and was intended only for
    mlocking, not munlocking. The original intention of draining the LRU
    pagevec at mlock time was to ensure the pages were on the LRU before the
    lock operation so that they could be placed on the unevictable list
    immediately. There is very little motivation to do the same in the
    munlock path this, particularly for every single page.

    This patch therefore removes the call completely. After removing the
    call, a 10% speedup was measured for munlock() of a 56GB large memory area
    with THP disabled.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

29 Mar, 2013

1 commit

  • This reverts commit 186930500985 ("mm: introduce VM_POPULATE flag to
    better deal with racy userspace programs").

    VM_POPULATE only has any effect when userspace plays racy games with
    vmas by trying to unmap and remap memory regions that mmap or mlock are
    operating on.

    Also, the only effect of VM_POPULATE when userspace plays such games is
    that it avoids populating new memory regions that get remapped into the
    address range that was being operated on by the original mmap or mlock
    calls.

    Let's remove VM_POPULATE as there isn't any strong argument to mandate a
    new vm_flag.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

28 Feb, 2013

1 commit

  • munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
    at a time. When munlocking THP pages (or the huge zero page), this
    resulted in taking the mm->page_table_lock 512 times in a row.

    We can do better by making use of the page_mask returned by
    follow_page_mask (for the huge zero page case), or the size of the page
    munlock_vma_page() operated on (for the true THP page case).

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

24 Feb, 2013

5 commits

  • Use long type for page counts in mm_populate() so as to avoid integer
    overflow when running the following test code:

    int main(void) {
    void *p = mmap(NULL, 0x100000000000, PROT_READ,
    MAP_PRIVATE | MAP_ANON, -1, 0);
    printf("p: %p\n", p);
    mlockall(MCL_CURRENT);
    printf("done\n");
    return 0;
    }

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The fact that mlock calls get_user_pages, and get_user_pages might call
    mlock when expanding a stack looks like a potential recursion.

    However, mlock makes sure the requested range is already contained
    within a vma, so no stack expansion will actually happen from mlock.

    Should this ever change: the stack expansion mlocks only the newly
    expanded range and so will not result in recursive expansion.

    Signed-off-by: Johannes Weiner
    Reported-by: Al Viro
    Cc: Hugh Dickins
    Acked-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The vm_populate() code populates user mappings without constantly
    holding the mmap_sem. This makes it susceptible to racy userspace
    programs: the user mappings may change while vm_populate() is running,
    and in this case vm_populate() may end up populating the new mapping
    instead of the old one.

    In order to reduce the possibility of userspace getting surprised by
    this behavior, this change introduces the VM_POPULATE vma flag which
    gets set on vmas we want vm_populate() to work on. This way
    vm_populate() may still end up populating the new mapping after such a
    race, but only if the new mapping is also one that the user has
    requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In find_extend_vma(), we don't need mlock_vma_pages_range() to verify
    the vma type - we know we're working with a stack. So, we can call
    directly into __mlock_vma_pages_range(), and remove the last
    make_pages_present() call site.

    Note that we don't use mm_populate() here, so we can't release the
    mmap_sem while allocating new stack pages. This is deemed acceptable,
    because the stack vmas grow by a bounded number of pages at a time, and
    these are anon pages so we don't have to read from disk to populate
    them.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
    with MCL_FUTURE in effect), we want to populate the pages within the
    newly created vmas. This may take a while as we may have to read pages
    from disk, so ideally we want to do this outside of the write-locked
    mmap_sem region.

    This change introduces mm_populate(), which is used to defer populating
    such mappings until after the mmap_sem write lock has been released.
    This is implemented as a generalization of the former do_mlock_pages(),
    which accomplished the same task but was using during mlock() /
    mlockall().

    Signed-off-by: Michel Lespinasse
    Reported-by: Andy Lutomirski
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

13 Feb, 2013

1 commit

  • With commit 8e72033f2a48 ("thp: make MADV_HUGEPAGE check for
    mm->def_flags") the VM_NOHUGEPAGE flag may be set on s390 in
    mm->def_flags for certain processes, to prevent future thp mappings.
    This would be overwritten by do_mlockall(), which sets it back to 0 with
    an optional VM_LOCKED flag set.

    To fix this, instead of overwriting mm->def_flags in do_mlockall(), only
    the VM_LOCKED flag should be set or cleared.

    Signed-off-by: Gerald Schaefer
    Reported-by: Vivek Goyal
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

09 Oct, 2012

3 commits

  • NR_MLOCK is only accounted in single page units: there's no logic to
    handle transparent hugepages. This patch checks the appropriate number of
    pages to adjust the statistics by so that the correct amount of memory is
    reflected.

    Currently:

    $ grep Mlocked /proc/meminfo
    Mlocked: 19636 kB

    #define MAP_SIZE (4 << 30) /* 4GB */

    void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    mlock(ptr, MAP_SIZE);

    $ grep Mlocked /proc/meminfo
    Mlocked: 29844 kB

    munlock(ptr, MAP_SIZE);

    $ grep Mlocked /proc/meminfo
    Mlocked: 19636 kB

    And with this patch:

    $ grep Mlock /proc/meminfo
    Mlocked: 19636 kB

    mlock(ptr, MAP_SIZE);

    $ grep Mlock /proc/meminfo
    Mlocked: 4213664 kB

    munlock(ptr, MAP_SIZE);

    $ grep Mlock /proc/meminfo
    Mlocked: 19636 kB

    Signed-off-by: David Rientjes
    Reported-by: Hugh Dickens
    Acked-by: Hugh Dickins
    Reviewed-by: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Reviewed-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • We had thought that pages could no longer get freed while still marked as
    mlocked; but Johannes Weiner posted this program to demonstrate that
    truncating an mlocked private file mapping containing COWed pages is still
    mishandled:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    char *map;
    int fd;

    system("grep mlockfreed /proc/vmstat");
    fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
    unlink("chigurh");
    ftruncate(fd, 4096);
    map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
    map[0] = 11;
    mlock(map, sizeof(fd));
    ftruncate(fd, 0);
    close(fd);
    munlock(map, sizeof(fd));
    munmap(map, 4096);
    system("grep mlockfreed /proc/vmstat");
    return 0;
    }

    The anon COWed pages are not caught by truncation's clear_page_mlock() of
    the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
    look out for them there in page_remove_rmap(). Indeed, why should
    truncation or invalidation be doing the clear_page_mlock() when removing
    from pagecache? mlock is a property of mapping in userspace, not a
    property of pagecache: an mlocked unmapped page is nonsensical.

    Reported-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
    currently it lost original meaning but still has some effects:

    | effect | alternative flags
    -+------------------------+---------------------------------------------
    1| account as reserved_vm | VM_IO
    2| skip in core dump | VM_IO, VM_DONTDUMP
    3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
    4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

    This patch removes reserved_vm counter from mm_struct. Seems like nobody
    cares about it, it does not exported into userspace directly, it only
    reduces total_vm showed in proc.

    Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

    remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
    remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

    [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

07 Mar, 2012

1 commit

  • Several users of "find_vma_prev()" were not in fact interested in the
    previous vma if there was no primary vma to be found either. And in
    those cases, we're much better off just using the regular "find_vma()",
    and then "prev" can be looked up by just checking vma->vm_prev.

    The find_vma_prev() semantics are fairly subtle (see Mikulas' recent
    commit 83cd904d271b: "mm: fix find_vma_prev"), and the whole "return
    prev by reference" means that it generates worse code too.

    Thus this "let's avoid using this inconvenient and clearly too subtle
    interface when we don't really have to" patch.

    Cc: Mikulas Patocka
    Cc: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

2 commits

  • A process spent 30 minutes exiting, just munlocking the pages of a large
    anonymous area that had been alternately mprotected into page-sized vmas:
    for every single page there's an anon_vma walk through all the other
    little vmas to find the right one.

    A general fix to that would be a lot more complicated (use prio_tree on
    anon_vma?), but there's one very simple thing we can do to speed up the
    common case: if a page to be munlocked is mapped only once, then it is our
    vma that it is mapped into, and there's no need whatever to walk through
    all the others.

    Okay, there is a very remote race in munlock_vma_pages_range(), if between
    its follow_page() and lock_page(), another process were to munlock the
    same page, then page reclaim remove it from our vma, then another process
    mlock it again. We would find it with page_mapcount 1, yet it's still
    mlocked in another process. But never mind, that's much less likely than
    the down_read_trylock() failure which munlocking already tolerates (in
    try_to_unmap_one()): in due course page reclaim will discover and move the
    page to unevictable instead.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Hugh Dickins
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • MCL_FUTURE does not move pages between lru list and draining the LRU per
    cpu pagevecs is a nasty activity. Avoid doing it unecessarily.

    Signed-off-by: Christoph Lameter
    Cc: David Rientjes
    Reviewed-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

31 Oct, 2011

1 commit


27 May, 2011

1 commit

  • The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
    'unsigned int'. This patch fixes such misuse.

    Signed-off-by: KOSAKI Motohiro
    [ Changed to use a typedef - we'll extend it to cover more cases
    later, since there has been discussion about making it a 64-bit
    type.. - Linus ]
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

05 May, 2011

1 commit

  • The logic in __get_user_pages() used to skip the stack guard page lookup
    whenever the caller wasn't interested in seeing what the actual page
    was. But Michel Lespinasse points out that there are cases where we
    don't care about the physical page itself (so 'pages' may be NULL), but
    do want to make sure a page is mapped into the virtual address space.

    So using the existence of the "pages" array as an indication of whether
    to look up the guard page or not isn't actually so great, and we really
    should just use the FOLL_MLOCK bit. But because that bit was only set
    for the VM_LOCKED case (and not all vma's necessarily have it, even for
    mlock()), we couldn't do that originally.

    Fix that by moving the VM_LOCKED check deeper into the call-chain, which
    actually simplifies many things. Now mlock() gets simpler, and we can
    also check for FOLL_MLOCK in __get_user_pages() and the code ends up
    much more straightforward.

    Reported-and-reviewed-by: Michel Lespinasse
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Apr, 2011

1 commit

  • Commit 53a7706d5ed8 ("mlock: do not hold mmap_sem for extended periods
    of time") changed mlock() to care about the exact number of pages that
    __get_user_pages() had brought it. Before, it would only care about
    errors.

    And that doesn't work, because we also handled one page specially in
    __mlock_vma_pages_range(), namely the stack guard page. So when that
    case was handled, the number of pages that the function returned was off
    by one. In particular, it could be zero, and then the caller would end
    up not making any progress at all.

    Rather than try to fix up that off-by-one error for the mlock case
    specially, this just moves the logic to handle the stack guard page
    into__get_user_pages() itself, thus making all the counts come out
    right automatically.

    Reported-by: Robert Święcki
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Mar, 2011

1 commit

  • Morally, the presence of a gate vma is more an attribute of a particular mm than
    a particular task. Moreover, dropping the dependency on task_struct will help
    make both existing and future operations on mm's more flexible and convenient.

    Signed-off-by: Stephen Wilson
    Reviewed-by: Michel Lespinasse
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Al Viro

    Stephen Wilson
     

02 Feb, 2011

1 commit

  • As Tao Ma noticed, change 5ecfda0 breaks blktrace. This is because
    blktrace mmaps a file with PROT_WRITE permissions but without PROT_READ,
    so my attempt to not unnecessarity break COW during mlock ended up
    causing mlock to fail with a permission problem.

    I am proposing to let mlock ignore vma protection in all cases except
    PROT_NONE. In particular, mlock should not fail for PROT_WRITE regions
    (as in the blktrace case, which broke at 5ecfda0) or for PROT_EXEC
    regions (which seem to me like they were always broken).

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Michel Lespinasse