01 Jul, 2014

1 commit

  • commit 7f39dda9d86fb4f4f17af0de170decf125726f8c upstream.

    Trinity reports BUG:

    sleeping function called from invalid context at kernel/locking/rwsem.c:47
    in_atomic(): 0, irqs_disabled(): 0, pid: 5787, name: trinity-c27

    __might_sleep < down_write < __put_anon_vma < page_get_anon_vma <
    migrate_pages < compact_zone < compact_zone_order < try_to_compact_pages ..

    Right, since conversion to mutex then rwsem, we should not put_anon_vma()
    from inside an rcu_read_lock()ed section: fix the two places that did so.
    And add might_sleep() to anon_vma_free(), as suggested by Peter Zijlstra.

    Fixes: 88c22088bf23 ("mm: optimize page_lock_anon_vma() fast-path")
    Reported-by: Dave Jones
    Signed-off-by: Hugh Dickins
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     

12 Jun, 2014

1 commit

  • commit 624483f3ea82598ab0f62f1bdb9177f531ab1892 upstream.

    While working address sanitizer for kernel I've discovered
    use-after-free bug in __put_anon_vma.

    For the last anon_vma, anon_vma->root freed before child anon_vma.
    Later in anon_vma_free(anon_vma) we are referencing to already freed
    anon_vma->root to check rwsem.

    This fixes it by freeing the child anon_vma before freeing
    anon_vma->root.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     

06 May, 2014

1 commit

  • commit 57e68e9cd65b4b8eb4045a1e0d0746458502554c upstream.

    A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
    fuzzing with trinity. The call site try_to_unmap_cluster() does not lock
    the pages other than its check_page parameter (which is already locked).

    The BUG_ON in mlock_vma_page() is not documented and its purpose is
    somewhat unclear, but apparently it serializes against page migration,
    which could otherwise fail to transfer the PG_mlocked flag. This would
    not be fatal, as the page would be eventually encountered again, but
    NR_MLOCK accounting would become distorted nevertheless. This patch adds
    a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
    effect.

    The call site try_to_unmap_cluster() is fixed so that for page !=
    check_page, trylock_page() is attempted (to avoid possible deadlocks as we
    already have check_page locked) and mlock_vma_page() is performed only
    upon success. If the page lock cannot be obtained, the page is left
    without PG_mlocked, which is again not a problem in the whole unevictable
    memory design.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Bob Liu
    Reported-by: Sasha Levin
    Cc: Wanpeng Li
    Cc: Michel Lespinasse
    Cc: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

10 Jan, 2014

1 commit

  • commit 98398c32f6687ee1e1f3ae084effb4b75adb0747 upstream.

    In __page_check_address(), if address's pud is not present,
    huge_pte_offset() will return NULL, we should check the return value.

    Signed-off-by: Jianguo Wu
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: qiuxishi
    Cc: Hanjun Guo
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jianguo Wu
     

30 Apr, 2013

1 commit

  • We have to recompute pgoff if the given page is huge, since result based
    on HPAGE_SIZE is not approapriate for scanning the vma interval tree, as
    shown by commit 36e4f20af833 ("hugetlb: do not use vma_hugecache_offset()
    for vma_prio_tree_foreach").

    Signed-off-by: Hillf Danton
    Cc: Michal Hocko
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

24 Feb, 2013

1 commit

  • The comment in commit 4fc3f1d66b1e ("mm/rmap, migration: Make
    rmap_walk_anon() and try_to_unmap_anon() more scalable") says:

    | Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
    | to make it clearer that it's an exclusive write-lock in
    | that case - suggested by Rik van Riel.

    But that commit renames only anon_vma_lock()

    Signed-off-by: Konstantin Khlebnikov
    Cc: Ingo Molnar
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

14 Feb, 2013

1 commit

  • The s390 architecture is unique in respect to dirty page detection,
    it uses the change bit in the per-page storage key to track page
    modifications. All other architectures track dirty bits by means
    of page table entries. This property of s390 has caused numerous
    problems in the past, e.g. see git commit ef5d437f71afdf4a
    "mm: fix XFS oops due to dirty pages without buffers on s390".

    To avoid future issues in regard to per-page dirty bits convert
    s390 to a fault based software dirty bit detection mechanism. All
    user page table entries which are marked as clean will be hardware
    read-only, even if the pte is supposed to be writable. A write by
    the user process will trigger a protection fault which will cause
    the user pte to be marked as dirty and the hardware read-only bit
    is removed.

    With this change the dirty bit in the storage key is irrelevant
    for Linux as a host, but the storage key is still required for
    KVM guests. The effect is that page_test_and_clear_dirty and the
    related code can be removed. The referenced bit in the storage
    key is still used by the page_test_and_clear_young primitive to
    provide page age information.

    For page cache pages of mappings with mapping_cap_account_dirty
    there will not be any change in behavior as the dirty bit tracking
    already uses read-only ptes to control the amount of dirty pages.
    Only for swap cache pages and pages of mappings without
    mapping_cap_account_dirty there can be additional protection faults.
    To avoid an excessive number of additional faults the mk_pte
    primitive checks for PageDirty if the pgprot value allows for writes
    and pre-dirties the pte. That avoids all additional faults for
    tmpfs and shmem pages until these pages are added to the swap cache.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • Memory error handling on hugepages can break a RSS counter, which emits a
    message like "Bad rss-counter state mm:ffff88040abecac0 idx:1 val:-1".
    This is because PageAnon returns true for hugepage (this behavior is
    necessary for reverse mapping to work on hugetlbfs).

    [akpm@linux-foundation.org: clean up code layout]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

12 Dec, 2012

2 commits

  • Add comments that dirty bit in storage key gets set whenever page content
    is changed. Hopefully if someone will use this function, he'll have a
    look at one of the two places where we comment on this.

    Signed-off-by: Jan Kara
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Several place need to find the pmd by(mm_struct, address), so introduce a
    function to simplify it.

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Ni zhan Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     

11 Dec, 2012

2 commits

  • rmap_walk_anon() and try_to_unmap_anon() appears to be too
    careful about locking the anon vma: while it needs protection
    against anon vma list modifications, it does not need exclusive
    access to the list itself.

    Transforming this exclusive lock to a read-locked rwsem removes
    a global lock from the hot path of page-migration intense
    threaded workloads which can cause pathological performance like
    this:

    96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
    |
    --- perf_trace_sched_switch
    __schedule
    schedule
    schedule_preempt_disabled
    __mutex_lock_common.isra.6
    __mutex_lock_slowpath
    mutex_lock
    |
    |--50.61%-- rmap_walk
    | move_to_new_page
    | migrate_pages
    | migrate_misplaced_page
    | __do_numa_page.isra.69
    | handle_pte_fault
    | handle_mm_fault
    | __do_page_fault
    | do_page_fault
    | page_fault
    | __memset_sse2
    | |
    | --100.00%-- worker_thread
    | |
    | --100.00%-- start_thread
    |
    --49.39%-- page_lock_anon_vma
    try_to_unmap_anon
    try_to_unmap
    migrate_pages
    migrate_misplaced_page
    __do_numa_page.isra.69
    handle_pte_fault
    handle_mm_fault
    __do_page_fault
    do_page_fault
    page_fault
    __memset_sse2
    |
    --100.00%-- worker_thread
    start_thread

    With this change applied the profile is now nicely flat
    and there's no anon-vma related scheduling/blocking.

    Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
    to make it clearer that it's an exclusive write-lock in
    that case - suggested by Rik van Riel.

    Suggested-by: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Ingo Molnar
     
  • Convert the struct anon_vma::mutex to an rwsem, which will help
    in solving a page-migration scalability problem. (Addressed in
    a separate patch.)

    The conversion is simple and straightforward: in every case
    where we mutex_lock()ed we'll now down_write().

    Suggested-by: Linus Torvalds
    Reviewed-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Ingo Molnar
     

26 Oct, 2012

1 commit

  • On s390 any write to a page (even from kernel itself) sets architecture
    specific page dirty bit. Thus when a page is written to via buffered
    write, HW dirty bit gets set and when we later map and unmap the page,
    page_remove_rmap() finds the dirty bit and calls set_page_dirty().

    Dirtying of a page which shouldn't be dirty can cause all sorts of
    problems to filesystems. The bug we observed in practice is that
    buffers from the page get freed, so when the page gets later marked as
    dirty and writeback writes it, XFS crashes due to an assertion
    BUG_ON(!PagePrivate(page)) in page_buffers() called from
    xfs_count_page_state().

    Similar problem can also happen when zero_user_segment() call from
    xfs_vm_writepage() (or block_write_full_page() for that matter) set the
    hardware dirty bit during writeback, later buffers get freed, and then
    page unmapped.

    Fix the issue by ignoring s390 HW dirty bit for page cache pages of
    mappings with mapping_cap_account_dirty(). This is safe because for
    such mappings when a page gets marked as writeable in PTE it is also
    marked dirty in do_wp_page() or do_page_fault(). When the dirty bit is
    cleared by clear_page_dirty_for_io(), the page gets writeprotected in
    page_mkclean(). So pagecache page is writeable if and only if it is
    dirty.

    Thanks to Hugh Dickins for pointing out mapping has to have
    mapping_cap_account_dirty() for things to work and proposing a cleaned
    up variant of the patch.

    The patch has survived about two hours of running fsx-linux on tmpfs
    while heavily swapping and several days of running on out build machines
    where the original problem was triggered.

    Signed-off-by: Jan Kara
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Heiko Carstens
    Cc: [3.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

09 Oct, 2012

7 commits

  • In order to allow sleeping during mmu notifier calls, we need to avoid
    invoking them under the page table spinlock. This patch solves the
    problem by calling invalidate_page notification after releasing the lock
    (but before freeing the page itself), or by wrapping the page invalidation
    with calls to invalidate_range_begin and invalidate_range_end.

    To prevent accidental changes to the invalidate_range_end arguments after
    the call to invalidate_range_begin, the patch introduces a convention of
    saving the arguments in consistently named locals:

    unsigned long mmun_start; /* For mmu_notifiers */
    unsigned long mmun_end; /* For mmu_notifiers */

    ...

    mmun_start = ...
    mmun_end = ...
    mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

    ...

    mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

    The patch changes code to use this convention for all calls to
    mmu_notifier_invalidate_range_start/end, except those where the calls are
    close enough so that anyone who glances at the code can see the values
    aren't changing.

    This patchset is a preliminary step towards on-demand paging design to be
    added to the RDMA stack.

    Why do we want on-demand paging for Infiniband?

    Applications register memory with an RDMA adapter using system calls,
    and subsequently post IO operations that refer to the corresponding
    virtual addresses directly to HW. Until now, this was achieved by
    pinning the memory during the registration calls. The goal of on demand
    paging is to avoid pinning the pages of registered memory regions (MRs).
    This will allow users the same flexibility they get when swapping any
    other part of their processes address spaces. Instead of requiring the
    entire MR to fit in physical memory, we can allow the MR to be larger,
    and only fit the current working set in physical memory.

    Why should anyone care? What problems are users currently experiencing?

    This can make programming with RDMA much simpler. Today, developers
    that are working with more data than their RAM can hold need either to
    deregister and reregister memory regions throughout their process's
    life, or keep a single memory region and copy the data to it. On demand
    paging will allow these developers to register a single MR at the
    beginning of their process's life, and let the operating system manage
    which pages needs to be fetched at a given time. In the future, we
    might be able to provide a single memory access key for each process
    that would provide the entire process's address as one large memory
    region, and the developers wouldn't need to register memory regions at
    all.

    Is there any prospect that any other subsystems will utilise these
    infrastructural changes? If so, which and how, etc?

    As for other subsystems, I understand that XPMEM wanted to sleep in
    MMU notifiers, as Christoph Lameter wrote at
    http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
    perhaps Andrea knows about other use cases.

    Scheduling in mmu notifications is required since we need to sync the
    hardware with the secondary page tables change. A TLB flush of an IO
    device is inherently slower than a CPU TLB flush, so our design works by
    sending the invalidation request to the device, and waiting for an
    interrupt before exiting the mmu notifier handler.

    Avi said:

    kvm may be a buyer. kvm::mmu_lock, which serializes guest page
    faults, also protects long operations such as destroying large ranges.
    It would be good to convert it into a spinlock, but as it is used inside
    mmu notifiers, this cannot be done.

    (there are alternatives, such as keeping the spinlock and using a
    generation counter to do the teardown in O(1), which is what the "may"
    is doing up there).

    [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Haggai Eran
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Or Gerlitz
    Cc: Haggai Eran
    Cc: Shachar Raindel
    Cc: Liran Liss
    Cc: Christoph Lameter
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sagi Grimberg
     
  • We had thought that pages could no longer get freed while still marked as
    mlocked; but Johannes Weiner posted this program to demonstrate that
    truncating an mlocked private file mapping containing COWed pages is still
    mishandled:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    char *map;
    int fd;

    system("grep mlockfreed /proc/vmstat");
    fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
    unlink("chigurh");
    ftruncate(fd, 4096);
    map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
    map[0] = 11;
    mlock(map, sizeof(fd));
    ftruncate(fd, 0);
    close(fd);
    munlock(map, sizeof(fd));
    munmap(map, 4096);
    system("grep mlockfreed /proc/vmstat");
    return 0;
    }

    The anon COWed pages are not caught by truncation's clear_page_mlock() of
    the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
    look out for them there in page_remove_rmap(). Indeed, why should
    truncation or invalidation be doing the clear_page_mlock() when removing
    from pagecache? mlock is a property of mapping in userspace, not a
    property of pagecache: an mlocked unmapped page is nonsensical.

    Reported-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_evictable(page, vma) is an irritant: almost all its callers pass
    NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
    explicitly in the couple of places it's needed. But in those places we
    don't even need page_evictable() itself! They're dealing with a freshly
    allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In file and anon rmap, we use interval trees to find potentially relevant
    vmas and then call vma_address() to find the virtual address the given
    page might be found at in these vmas. vma_address() used to include a
    check that the returned address falls within the limits of the vma, but
    this check isn't necessary now that we always use interval trees in rmap:
    the interval tree just doesn't return any vmas which this check would find
    to be irrelevant. As a result, we can replace the use of -EFAULT error
    code (which then needed to be checked in every call site) with a
    VM_BUG_ON().

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When a large VMA (anon or private file mapping) is first touched, which
    will populate its anon_vma field, and then split into many regions through
    the use of mprotect(), the original anon_vma ends up linking all of the
    vmas on a linked list. This can cause rmap to become inefficient, as we
    have to walk potentially thousands of irrelevent vmas before finding the
    one a given anon page might fall into.

    By replacing the same_anon_vma linked list with an interval tree (where
    each avc's interval is determined by its vma's start and last pgoffs), we
    can make rmap efficient for this use case again.

    While the change is large, all of its pieces are fairly simple.

    Most places that were walking the same_anon_vma list were looking for a
    known pgoff, so they can just use the anon_vma_interval_tree_foreach()
    interval tree iterator instead. The exception here is ksm, where the
    page's index is not known. It would probably be possible to rework ksm so
    that the index would be known, but for now I have decided to keep things
    simple and just walk the entirety of the interval tree there.

    When updating vma's that already have an anon_vma assigned, we must take
    care to re-index the corresponding avc's on their interval tree. This is
    done through the use of anon_vma_interval_tree_pre_update_vma() and
    anon_vma_interval_tree_post_update_vma(), which remove the avc's from
    their interval tree before the update and re-insert them after the update.
    The anon_vma stays locked during the update, so there is no chance that
    rmap would miss the vmas that are being updated.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • mremap() had a clever optimization where move_ptes() did not take the
    anon_vma lock to avoid a race with anon rmap users such as page migration.
    Instead, the avc's were ordered in such a way that the origin vma was
    always visited by rmap before the destination. This ordering and the use
    of page table locks rmap usage safe. However, we want to replace the use
    of linked lists in anon rmap with an interval tree, and this will make it
    harder to impose such ordering as the interval tree will always be sorted
    by the avc->vma->vm_pgoff value. For now, let's replace the
    anon_vma_moveto_tail() ordering function with proper anon_vma locking in
    move_ptes(). Once we have the anon interval tree in place, we will
    re-introduce an optimization to avoid taking these locks in the most
    common cases.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

30 May, 2012

1 commit

  • The swap token code no longer fits in with the current VM model. It
    does not play well with cgroups or the better NUMA placement code in
    development, since we have only one swap token globally.

    It also has the potential to mess with scalability of the system, by
    increasing the number of non-reclaimable pages on the active and
    inactive anon LRU lists.

    Last but not least, the swap token code has been broken for a year
    without complaints, as reported by Konstantin Khlebnikov. This suggests
    we no longer have much use for it.

    The days of sub-1G memory systems with heavy use of swap are over. If
    we ever need thrashing reducing code in the future, we will have to
    implement something that does scale.

    Signed-off-by: Rik van Riel
    Cc: Konstantin Khlebnikov
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Acked-by: Bob Picco
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

22 Mar, 2012

3 commits

  • Now, page-stat-per-memcg is recorded into per page_cgroup flag by
    duplicating page's status into the flag. The reason is that memcg has a
    feature to move a page from a group to another group and we have race
    between "move" and "page stat accounting",

    Under current logic, assume CPU-A and CPU-B. CPU-A does "move" and CPU-B
    does "page stat accounting".

    When CPU-A goes 1st,

    CPU-A CPU-B
    update "struct page" info.
    move_lock_mem_cgroup(memcg)
    see pc->flags
    copy page stat to new group
    overwrite pc->mem_cgroup.
    move_unlock_mem_cgroup(memcg)
    move_lock_mem_cgroup(mem)
    set pc->flags
    update page stat accounting
    move_unlock_mem_cgroup(mem)

    stat accounting is guarded by move_lock_mem_cgroup() and "move" logic
    (CPU-A) doesn't see changes in "struct page" information.

    But it's costly to have the same information both in 'struct page' and
    'struct page_cgroup'. And, there is a potential problem.

    For example, assume we have PG_dirty accounting in memcg.
    PG_..is a flag for struct page.
    PCG_ is a flag for struct page_cgroup.
    (This is just an example. The same problem can be found in any
    kind of page stat accounting.)

    CPU-A CPU-B
    TestSet PG_dirty
    (delay) TestClear PG_dirty
    if (TestClear(PCG_dirty))
    memcg->nr_dirty--
    if (TestSet(PCG_dirty))
    memcg->nr_dirty++

    Here, memcg->nr_dirty = +1, this is wrong. This race was reported by Greg
    Thelen . Now, only FILE_MAPPED is supported but
    fortunately, it's serialized by page table lock and this is not real bug,
    _now_,

    If this potential problem is caused by having duplicated information in
    struct page and struct page_cgroup, we may be able to fix this by using
    original 'struct page' information. But we'll have a problem in "move
    account"

    Assume we use only PG_dirty.

    CPU-A CPU-B
    TestSet PG_dirty
    (delay) move_lock_mem_cgroup()
    if (PageDirty(page))
    new_memcg->nr_dirty++
    pc->mem_cgroup = new_memcg;
    move_unlock_mem_cgroup()
    move_lock_mem_cgroup()
    memcg = pc->mem_cgroup
    new_memcg->nr_dirty++

    accounting information may be double-counted. This was original reason to
    have PCG_xxx flags but it seems PCG_xxx has another problem.

    I think we need a bigger lock as

    move_lock_mem_cgroup(page)
    TestSetPageDirty(page)
    update page stats (without any checks)
    move_unlock_mem_cgroup(page)

    This fixes both of problems and we don't have to duplicate page flag into
    page_cgroup. Please note: move_lock_mem_cgroup() is held only when there
    are possibility of "account move" under the system. So, in most path,
    status update will go without atomic locks.

    This patch introduces mem_cgroup_begin_update_page_stat() and
    mem_cgroup_end_update_page_stat() both should be called at modifying
    'struct page' information if memcg takes care of it. as

    mem_cgroup_begin_update_page_stat()
    modify page information
    mem_cgroup_update_page_stat()
    => never check any 'struct page' info, just update counters.
    mem_cgroup_end_update_page_stat().

    This patch is slow because we need to call begin_update_page_stat()/
    end_update_page_stat() regardless of accounted will be changed or not. A
    following patch adds an easy optimization and reduces the cost.

    [akpm@linux-foundation.org: s/lock/locked/]
    [hughd@google.com: fix deadlock by avoiding stat lock when anon]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Greg Thelen
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Reduce code duplication by calling anon_vma_chain_link() from
    anon_vma_prepare().

    Also move anon_vmal_chain_link() to a more suitable location in the file.

    Signed-off-by: Kautuk Consul
    Acked-by: Peter Zijlstra
    Cc: Hugh Dickins
    Reviewed-by: KAMEZWA Hiroyuki
    Cc: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • Since commit 2a11c8ea20bf ("kconfig: Introduce IS_ENABLED(),
    IS_BUILTIN() and IS_MODULE()") there is a generic grep-friendly method
    for checking config options in C expressions.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

13 Jan, 2012

1 commit


11 Jan, 2012

1 commit

  • migrate was doing an rmap_walk with speculative lock-less access on
    pagetables. That could lead it to not serializing properly against mremap
    PT locks. But a second problem remains in the order of vmas in the
    same_anon_vma list used by the rmap_walk.

    If vma_merge succeeds in copy_vma, the src vma could be placed after the
    dst vma in the same_anon_vma list. That could still lead to migrate
    missing some pte.

    This patch adds an anon_vma_moveto_tail() function to force the dst vma at
    the end of the list before mremap starts to solve the problem.

    If the mremap is very large and there are a lots of parents or childs
    sharing the anon_vma root lock, this should still scale better than taking
    the anon_vma root lock around every pte copy practically for the whole
    duration of mremap.

    Update: Hugh noticed special care is needed in the error path where
    move_page_tables goes in the reverse direction, a second
    anon_vma_moveto_tail() call is needed in the error path.

    This program exercises the anon_vma_moveto_tail:

    ===

    int main()
    {
    static struct timeval oldstamp, newstamp;
    long diffsec;
    char *p, *p2, *p3, *p4;
    if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);

    memset(p, 0xff, SIZE);
    printf("%p\n", p);
    memset(p2, 0xff, SIZE);
    memset(p3, 0x77, 4096);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
    if (p4 != p3)
    perror("mremap"), exit(1);
    p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
    if (p4 != p+SIZE/2)
    perror("mremap"), exit(1);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    printf("ok\n");

    return 0;
    }
    ===

    $ perf probe -a anon_vma_moveto_tail
    Add new event:
    probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

    You can now use it on all perf tools, such as:

    perf record -e probe:anon_vma_moveto_tail -aR sleep 1

    $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
    0x7f2ca2800000
    ok
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
    $ perf report --stdio
    100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

    Signed-off-by: Andrea Arcangeli
    Reported-by: Nai Xia
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Pawel Sikora
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

1 commit


31 Oct, 2011

1 commit


27 Jul, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
    mm: properly reflect task dirty limits in dirty_exceeded logic
    writeback: don't busy retry writeback on new/freeing inodes
    writeback: scale IO chunk size up to half device bandwidth
    writeback: trace global_dirty_state
    writeback: introduce max-pause and pass-good dirty limits
    writeback: introduce smoothed global dirty limit
    writeback: consolidate variable names in balance_dirty_pages()
    writeback: show bdi write bandwidth in debugfs
    writeback: bdi write bandwidth estimation
    writeback: account per-bdi accumulated written pages
    writeback: make writeback_control.nr_to_write straight
    writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
    writeback: trace event writeback_queue_io
    writeback: trace event writeback_single_inode
    writeback: remove .nonblocking and .encountered_congestion
    writeback: remove writeback_control.more_io
    writeback: skip balance_dirty_pages() for in-memory fs
    writeback: add bdi_dirty_limit() kernel-doc
    writeback: avoid extra sync work at enqueue time
    writeback: elevate queue_io() into wb_writeback()
    ...

    Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

    Linus Torvalds
     

24 Jul, 2011

1 commit

  • On x86 a page without a mapper is by definition not referenced / old.
    The s390 architecture keeps the reference bit in the storage key and
    the current code will check the storage key for page without a mapper.
    This leads to an interesting effect: the first time an s390 system
    needs to write pages to swap it only finds referenced pages. This
    causes a lot of pages to get added and written to the swap device.
    To avoid this behaviour change page_referenced to query the storage
    key only if there is a mapper of the page.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

21 Jul, 2011

1 commit

  • i_alloc_sem is a rather special rw_semaphore. It's the last one that may
    be released by a non-owner, and it's write side is always mirrored by
    real exclusion. It's intended use it to wait for all pending direct I/O
    requests to finish before starting a truncate.

    Replace it with a hand-grown construct:

    - exclusion for truncates is already guaranteed by i_mutex, so it can
    simply fall way
    - the reader side is replaced by an i_dio_count member in struct inode
    that counts the number of pending direct I/O requests. Truncate can't
    proceed as long as it's non-zero
    - when i_dio_count reaches non-zero we wake up a pending truncate using
    wake_up_bit on a new bit in i_flags
    - new references to i_dio_count can't appear while we are waiting for
    it to read zero because the direct I/O count always needs i_mutex
    (or an equivalent like XFS's i_iolock) for starting a new operation.

    This scheme is much simpler, and saves the space of a spinlock_t and a
    struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
    system).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

28 Jun, 2011

1 commit


18 Jun, 2011

3 commits

  • Hugh Dickins points out that lockdep (correctly) spots a potential
    deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation
    of anon_vma_chain while doing anon_vma_clone(). The problem is that
    page reclaim will want to take the anon_vma lock of any anonymous pages
    that it will try to reclaim.

    So re-organize the code in anon_vma_clone() slightly: first do just a
    GFP_NOWAIT allocation, which will usually work fine. But if that fails,
    let's just drop the lock and re-do the allocation, now with GFP_KERNEL.

    End result: not only do we avoid the locking problem, this also ends up
    getting better concurrency in case the allocation does need to block.
    Tim Chen reports that with all these anon_vma locking tweaks, we're now
    almost back up to the spinlock performance.

    Reported-and-tested-by: Hugh Dickins
    Tested-by: Tim Chen
    Cc: Peter Zijlstra
    Cc: Andi Kleen
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This matches the anon_vma_clone() case, and uses the same lock helper
    functions. Because of the need to potentially release the anon_vma's,
    it's a bit more complex, though.

    We traverse the 'vma->anon_vma_chain' in two phases: the first loop gets
    the anon_vma lock (with the helper function that only takes the lock
    once for the whole loop), and removes any entries that don't need any
    more processing.

    The second phase just traverses the remaining list entries (without
    holding the anon_vma lock), and does any actual freeing of the
    anon_vma's that is required.

    Signed-off-by: Peter Zijlstra
    Tested-by: Hugh Dickins
    Tested-by: Tim Chen
    Cc: Andi Kleen
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • In anon_vma_clone() we traverse the vma->anon_vma_chain of the source
    vma, locking the anon_vma for each entry.

    But they are all going to have the same root entry, which means that
    we're locking and unlocking the same lock over and over again. Which is
    expensive in locked operations, but can get _really_ expensive when that
    root entry sees any kind of lock contention.

    In fact, Tim Chen reports a big performance regression due to this: when
    we switched to use a mutex instead of a spinlock, the contention case
    gets much worse.

    So to alleviate this all, this commit creates a small helper function
    (lock_anon_vma_root()) that can be used to take the lock just once
    rather than taking and releasing it over and over again.

    We still have the same "take the lock and release" it behavior in the
    exit path (in unlink_anon_vmas()), but that one is a bit harder to fix
    since we're actually freeing the anon_vma entries as we go, and that
    will touch the lock too.

    Reported-and-tested-by: Tim Chen
    Tested-by: Hugh Dickins
    Cc: Peter Zijlstra
    Cc: Andi Kleen
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Jun, 2011

1 commit

  • Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
    as it's currently the most contended lock in the system for metadata
    heavy workloads. It won't help for single-filesystem workloads for
    which we'll need the I/O-less balance_dirty_pages, but at least we
    can dedicate a cpu to spinning on each bdi now for larger systems.

    Based on earlier patches from Nick Piggin and Dave Chinner.

    It reduces lock contentions to 1/4 in this test case:
    10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

    lock_stat version 0.3
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    vanilla 2.6.39-rc3:
    inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
    ------------------
    inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
    inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
    inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
    inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

    2.6.39-rc3 + patch:
    &(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
    ------------------------
    &(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
    &(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
    &(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
    ------------------------
    &(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
    &(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
    &(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
    &(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

    hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
    akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Wu Fengguang

    Christoph Hellwig
     

30 May, 2011

1 commit


29 May, 2011

1 commit

  • On one machine I've been getting hangs, a page fault's anon_vma_prepare()
    waiting in anon_vma_lock(), other processes waiting for that page's lock.

    This is a replay of last year's f18194275c39 "mm: fix hang on
    anon_vma->root->lock".

    The new page_lock_anon_vma() places too much faith in its refcount: when
    it has acquired the mutex_trylock(), it's possible that a racing task in
    anon_vma_alloc() has just reallocated the struct anon_vma, set refcount
    to 1, and is about to reset its anon_vma->root.

    Fix this by saving anon_vma->root, and relying on the usual page_mapped()
    check instead of a refcount check: if page is still mapped, the anon_vma
    is still ours; if page is not still mapped, we're no longer interested.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins