24 Feb, 2013

8 commits

  • The comment in commit 4fc3f1d66b1e ("mm/rmap, migration: Make
    rmap_walk_anon() and try_to_unmap_anon() more scalable") says:

    | Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
    | to make it clearer that it's an exclusive write-lock in
    | that case - suggested by Rik van Riel.

    But that commit renames only anon_vma_lock()

    Signed-off-by: Konstantin Khlebnikov
    Cc: Ingo Molnar
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • swap_lock is heavily contended when I test swap to 3 fast SSD (even
    slightly slower than swap to 2 such SSD). The main contention comes
    from swap_info_get(). This patch tries to fix the gap with adding a new
    per-partition lock.

    Global data like nr_swapfiles, total_swap_pages, least_priority and
    swap_list are still protected by swap_lock.

    nr_swap_pages is an atomic now, it can be changed without swap_lock. In
    theory, it's possible get_swap_page() finds no swap pages but actually
    there are free swap pages. But sounds not a big problem.

    Accessing partition specific data (like scan_swap_map and so on) is only
    protected by swap_info_struct.lock.

    Changing swap_info_struct.flags need hold swap_lock and
    swap_info_struct.lock, because scan_scan_map() will check it. read the
    flags is ok with either the locks hold.

    If both swap_lock and swap_info_struct.lock must be hold, we always hold
    the former first to avoid deadlock.

    swap_entry_free() can change swap_list. To delete that code, we add a
    new highest_priority_index. Whenever get_swap_page() is called, we
    check it. If it's valid, we use it.

    It's a pity get_swap_page() still holds swap_lock(). But in practice,
    swap_lock() isn't heavily contended in my test with this patch (or I can
    say there are other much more heavier bottlenecks like TLB flush). And
    BTW, looks get_swap_page() doesn't really need the lock. We never free
    swap_info[] and we check SWAP_WRITEOK flag. The only risk without the
    lock is we could swapout to some low priority swap, but we can quickly
    recover after several rounds of swap, so sounds not a big deal to me.
    But I'd prefer to fix this if it's a real problem.

    "swap: make each swap partition have one address_space" improved the
    swapout speed from 1.7G/s to 2G/s. This patch further improves the
    speed to 2.3G/s, so around 15% improvement. It's a multi-process test,
    so TLB flush isn't the biggest bottleneck before the patches.

    [arnd@arndb.de: fix it for nommu]
    [hughd@google.com: add missing unlock]
    [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Greg Kroah-Hartman
    Cc: Seth Jennings
    Cc: Konrad Rzeszutek Wilk
    Cc: Xiao Guangrong
    Cc: Dan Magenheimer
    Cc: Stephen Rothwell
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Hugh Dickins
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
    multiple, however there was no equivalent code in mm_populate(), which
    caused issues.

    This could be fixed by introduced the same rounding in mm_populate(),
    however I think it's preferable to make do_mmap_pgoff() return populate
    as a size rather than as a boolean, so we don't have to duplicate the
    size rounding logic in mm_populate().

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The vm_populate() code populates user mappings without constantly
    holding the mmap_sem. This makes it susceptible to racy userspace
    programs: the user mappings may change while vm_populate() is running,
    and in this case vm_populate() may end up populating the new mapping
    instead of the old one.

    In order to reduce the possibility of userspace getting surprised by
    this behavior, this change introduces the VM_POPULATE vma flag which
    gets set on vmas we want vm_populate() to work on. This way
    vm_populate() may still end up populating the new mapping after such a
    race, but only if the new mapping is also one that the user has
    requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In find_extend_vma(), we don't need mlock_vma_pages_range() to verify
    the vma type - we know we're working with a stack. So, we can call
    directly into __mlock_vma_pages_range(), and remove the last
    make_pages_present() call site.

    Note that we don't use mm_populate() here, so we can't release the
    mmap_sem while allocating new stack pages. This is deemed acceptable,
    because the stack vmas grow by a bounded number of pages at a time, and
    these are anon pages so we don't have to read from disk to populate
    them.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • After the MAP_POPULATE handling has been moved to mmap_region() call
    sites, the only remaining use of the flags argument is to pass the
    MAP_NORESERVE flag. This can be just as easily handled by
    do_mmap_pgoff(), so do that and remove the mmap_region() flags
    parameter.

    [akpm@linux-foundation.org: remove double parens]
    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
    with MCL_FUTURE in effect), we want to populate the pages within the
    newly created vmas. This may take a while as we may have to read pages
    from disk, so ideally we want to do this outside of the write-locked
    mmap_sem region.

    This change introduces mm_populate(), which is used to defer populating
    such mappings until after the mmap_sem write lock has been released.
    This is implemented as a generalization of the former do_mlock_pages(),
    which accomplished the same task but was using during mlock() /
    mlockall().

    Signed-off-by: Michel Lespinasse
    Reported-by: Andy Lutomirski
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

20 Feb, 2013

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "Main changes:

    - scheduler side full-dynticks (user-space execution is undisturbed
    and receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready, from Frederic
    Weisbecker.

    - Initial sched.h split-up changes, by Clark Williams

    - select_idle_sibling() performance improvement by Mike Galbraith:

    " 1 tbench pair (worst case) in a 10 core + SMT package:

    pre 15.22 MB/sec 1 procs
    post 252.01 MB/sec 1 procs "

    - sched_rr_get_interval() ABI fix/change. We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

    - misc RT scheduling cleanups, optimizations"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    sched/rt: Add header to
    cputime: Remove irqsave from seqlock readers
    sched, powerpc: Fix sched.h split-up build failure
    cputime: Restore CPU_ACCOUNTING config defaults for PPC64
    sched/rt: Move rt specific bits into new header file
    sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
    sched: Move sched.h sysctl bits into separate header
    sched: Fix signedness bug in yield_to()
    sched: Fix select_idle_sibling() bouncing cow syndrome
    sched/rt: Further simplify pick_rt_task()
    sched/rt: Do not account zero delta_exec in update_curr_rt()
    cputime: Safely read cputime of full dynticks CPUs
    kvm: Prepare to add generic guest entry/exit callbacks
    cputime: Use accessors to read task cputime stats
    cputime: Allow dynamic switch between tick/virtual based cputime accounting
    cputime: Generic on-demand virtual cputime accounting
    cputime: Move default nsecs_to_cputime() to jiffies based cputime file
    cputime: Librarize per nsecs resolution cputime definitions
    cputime: Avoid multiplication overflow on utime scaling
    context_tracking: Export context state for generic vtime
    ...

    Fix up conflict in kernel/context_tracking.c due to comment additions.

    Linus Torvalds
     

08 Feb, 2013

1 commit

  • Move the sysctl-related bits from include/linux/sched.h into
    a new file: include/linux/sched/sysctl.h. Then update source
    files requiring access to those bits by including the new
    header file.

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     

05 Feb, 2013

1 commit

  • We use rwsem since commit 5a505085f043 ("mm/rmap: Convert the struct
    anon_vma::mutex to an rwsem"). And most of comments are converted to
    the new rwsem lock; while just 2 more missed from:

    $ git grep 'anon_vma->mutex'

    Signed-off-by: Yuanhan Liu
    Acked-by: Ingo Molnar
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yuanhan Liu
     

12 Jan, 2013

1 commit

  • Commit 5a505085f043 ("mm/rmap: Convert the struct anon_vma::mutex to an
    rwsem") turned anon_vma mutex to rwsem.

    However, the properly annotated nested locking in mm_take_all_locks()
    has been converted from

    mutex_lock_nest_lock(&anon_vma->root->mutex, &mm->mmap_sem);

    to

    down_write(&anon_vma->root->rwsem);

    which is incomplete, and causes the false positive report from lockdep
    below.

    Annotate the fact that mmap_sem is used as an outter lock to serialize
    taking of all the anon_vma rwsems at once no matter the order, using the
    down_write_nest_lock() primitive.

    This patch fixes this lockdep report:

    =============================================
    [ INFO: possible recursive locking detected ]
    3.8.0-rc2-00036-g5f73896 #171 Not tainted
    ---------------------------------------------
    qemu-kvm/2315 is trying to acquire lock:
    (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

    but task is already holding lock:
    (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&anon_vma->rwsem);
    lock(&anon_vma->rwsem);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    4 locks held by qemu-kvm/2315:
    #0: (&mm->mmap_sem){++++++}, at: do_mmu_notifier_register+0xfc/0x170
    #1: (mm_all_locks_mutex){+.+...}, at: mm_take_all_locks+0x36/0x1b0
    #2: (&mapping->i_mmap_mutex){+.+...}, at: mm_take_all_locks+0xc9/0x1b0
    #3: (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

    stack backtrace:
    Pid: 2315, comm: qemu-kvm Not tainted 3.8.0-rc2-00036-g5f73896 #171
    Call Trace:
    print_deadlock_bug+0xf2/0x100
    validate_chain+0x4f6/0x720
    __lock_acquire+0x359/0x580
    lock_acquire+0x121/0x190
    down_write+0x3f/0x70
    mm_take_all_locks+0x149/0x1b0
    do_mmu_notifier_register+0x68/0x170
    mmu_notifier_register+0xe/0x10
    kvm_create_vm+0x22b/0x330 [kvm]
    kvm_dev_ioctl+0xf8/0x1a0 [kvm]
    do_vfs_ioctl+0x9d/0x350
    sys_ioctl+0x91/0xb0
    system_call_fastpath+0x16/0x1b

    Signed-off-by: Jiri Kosina
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

2 commits

  • expand_stack() runs with a shared mmap_sem lock. Because of this, there
    could be multiple concurrent stack expansions in the same mm, which may
    cause problems in the vma gap update code.

    I propose to solve this by taking the mm->page_table_lock around such vma
    expansions, in order to avoid the concurrency issue. We only have to
    worry about concurrent expand_stack() calls here, since we hold a shared
    mmap_sem lock and all vma modificaitons other than expand_stack() are done
    under an exclusive mmap_sem lock.

    I previously tried to achieve the same effect by making sure all growable
    vmas in a given mm would share the same anon_vma, which we already lock
    here. However this turned out to be difficult - all of the schemes I
    tried for refcounting the growable anon_vma and clearing turned out ugly.
    So, I'm now proposing only the minimal fix.

    The overhead of taking the page table lock during stack expansion is
    expected to be small: glibc doesn't use expandable stacks for the threads
    it creates, so having multiple growable stacks is actually uncommon and we
    don't expect the page table lock to get bounced between threads.

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • During reviewing the source code, I found a comment which mention that
    after f_op->mmap(), vma's start address can be changed. I didn't verify
    that it is really possible, because there are so many f_op->mmap()
    implementation. But if there are some mmap() which change vma's start
    address, it is possible error situation, because we already prepare prev
    vma, rb_link and rb_parent and these are related to original address.

    So add WARN_ON_ONCE for finding that this situtation really happens.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

12 Dec, 2012

5 commits

  • Merge misc updates from Andrew Morton:
    "About half of most of MM. Going very early this time due to
    uncertainty over the coreautounifiednumasched things. I'll send the
    other half of most of MM tomorrow. The rest of MM awaits a slab merge
    from Pekka."

    * emailed patches from Andrew Morton: (71 commits)
    memory_hotplug: ensure every online node has NORMAL memory
    memory_hotplug: handle empty zone when online_movable/online_kernel
    mm, memory-hotplug: dynamic configure movable memory and portion memory
    drivers/base/node.c: cleanup node_state_attr[]
    bootmem: fix wrong call parameter for free_bootmem()
    avr32, kconfig: remove HAVE_ARCH_BOOTMEM
    mm: cma: remove watermark hacks
    mm: cma: skip watermarks check for already isolated blocks in split_free_page()
    mm, oom: fix race when specifying a thread as the oom origin
    mm, oom: change type of oom_score_adj to short
    mm: cleanup register_node()
    mm, mempolicy: remove duplicate code
    mm/vmscan.c: try_to_freeze() returns boolean
    mm: introduce putback_movable_pages()
    virtio_balloon: introduce migration primitives to balloon pages
    mm: introduce compaction and migration for ballooned pages
    mm: introduce a common interface for balloon pages mobility
    mm: redefine address_space.assoc_mapping
    mm: adjust address_space_operations.migratepage() return code
    arch/sparc/kernel/sys_sparc_64.c: s/COLOUR/COLOR/
    ...

    Linus Torvalds
     
  • Implement vm_unmapped_area() using the rb_subtree_gap and highest_vm_end
    information to look up for suitable virtual address space gaps.

    struct vm_unmapped_area_info is used to define the desired allocation
    request:
    - lowest or highest possible address matching the remaining constraints
    - desired gap length
    - low/high address limits that the gap must fit into
    - alignment mask and offset

    Also update the generic arch_get_unmapped_area[_topdown] functions to make
    use of vm_unmapped_area() instead of implementing a brute force search.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When CONFIG_DEBUG_VM_RB is enabled, check that rb_subtree_gap is correctly
    set for every vma and that mm->highest_vm_end is also correct.

    Also add an explicit 'bug' variable to track if browse_rb() detected any
    invalid condition.

    [akpm@linux-foundation.org: repair innovative coding-style inventions]
    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Define vma->rb_subtree_gap as the largest gap between any vma in the
    subtree rooted at that vma, and their predecessor. Or, for a recursive
    definition, vma->rb_subtree_gap is the max of:

    - vma->vm_start - vma->vm_prev->vm_end
    - rb_subtree_gap fields of the vmas pointed by vma->rb.rb_left and
    vma->rb.rb_right

    This will allow get_unmapped_area_* to find a free area of the right
    size in O(log(N)) time, instead of potentially having to do a linear
    walk across all the VMAs.

    Also define mm->highest_vm_end as the vm_end field of the highest vma,
    so that we can easily check if the following gap is suitable.

    This does have the potential to make unmapping VMAs more expensive,
    especially for processes with very large numbers of VMAs, where the VMA
    rbtree can grow quite deep.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • There was some desire in large applications using MAP_HUGETLB or
    SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
    others. This is useful together with NUMA policy: use 2MB interleaving
    on some mappings, but 1GB on local mappings.

    This patch extends the IPC/SHM syscall interfaces slightly to allow
    specifying the page size.

    It borrows some upper bits in the existing flag arguments and allows
    encoding the log of the desired page size in addition to the *_HUGETLB
    flag. When 0 is specified the default size is used, this makes the
    change fully compatible.

    Extending the internal hugetlb code to handle this is straight forward.
    Instead of a single mount it just keeps an array of them and selects the
    right mount based on the specified page size. When no page size is
    specified it uses the mount of the default page size.

    The change is not visible in /proc/mounts because internal mounts don't
    appear there. It also has very little overhead: the additional mounts
    just consume a super block, but not more memory when not used.

    I also exported the new flags to the user headers (they were previously
    under __KERNEL__). Right now only symbols for x86 and some other
    architecture for 1GB and 2MB are defined. The interface should already
    work for all other architectures though. Only architectures that define
    multiple hugetlb sizes actually need it (that is currently x86, tile,
    powerpc). However tile and powerpc have user configurable hugetlb
    sizes, so it's not easy to add defines. A program on those
    architectures would need to query sysfs and use the appropiate log2.

    [akpm@linux-foundation.org: cleanups]
    [rientjes@google.com: fix build]
    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Andi Kleen
    Cc: Michael Kerrisk
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

11 Dec, 2012

2 commits

  • rmap_walk_anon() and try_to_unmap_anon() appears to be too
    careful about locking the anon vma: while it needs protection
    against anon vma list modifications, it does not need exclusive
    access to the list itself.

    Transforming this exclusive lock to a read-locked rwsem removes
    a global lock from the hot path of page-migration intense
    threaded workloads which can cause pathological performance like
    this:

    96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
    |
    --- perf_trace_sched_switch
    __schedule
    schedule
    schedule_preempt_disabled
    __mutex_lock_common.isra.6
    __mutex_lock_slowpath
    mutex_lock
    |
    |--50.61%-- rmap_walk
    | move_to_new_page
    | migrate_pages
    | migrate_misplaced_page
    | __do_numa_page.isra.69
    | handle_pte_fault
    | handle_mm_fault
    | __do_page_fault
    | do_page_fault
    | page_fault
    | __memset_sse2
    | |
    | --100.00%-- worker_thread
    | |
    | --100.00%-- start_thread
    |
    --49.39%-- page_lock_anon_vma
    try_to_unmap_anon
    try_to_unmap
    migrate_pages
    migrate_misplaced_page
    __do_numa_page.isra.69
    handle_pte_fault
    handle_mm_fault
    __do_page_fault
    do_page_fault
    page_fault
    __memset_sse2
    |
    --100.00%-- worker_thread
    start_thread

    With this change applied the profile is now nicely flat
    and there's no anon-vma related scheduling/blocking.

    Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
    to make it clearer that it's an exclusive write-lock in
    that case - suggested by Rik van Riel.

    Suggested-by: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Ingo Molnar
     
  • Convert the struct anon_vma::mutex to an rwsem, which will help
    in solving a page-migration scalability problem. (Addressed in
    a separate patch.)

    The conversion is simple and straightforward: in every case
    where we mutex_lock()ed we'll now down_write().

    Suggested-by: Linus Torvalds
    Reviewed-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Ingo Molnar
     

17 Nov, 2012

2 commits

  • Greg Kroah-Hartman
     
  • Iterating over the vma->anon_vma_chain without anon_vma_lock may cause
    NULL ptr deref in anon_vma_interval_tree_verify(), because the node in the
    chain might have been removed.

    BUG: unable to handle kernel paging request at fffffffffffffff0
    IP: [] anon_vma_interval_tree_verify+0xc/0xa0
    PGD 4e28067 PUD 4e29067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    CPU 0
    Pid: 9050, comm: trinity-child64 Tainted: G W 3.7.0-rc2-next-20121025-sasha-00001-g673f98e-dirty #77
    RIP: 0010: anon_vma_interval_tree_verify+0xc/0xa0
    Process trinity-child64 (pid: 9050, threadinfo ffff880045f80000, task ffff880048eb0000)
    Call Trace:
    validate_mm+0x58/0x1e0
    vma_adjust+0x635/0x6b0
    __split_vma.isra.22+0x161/0x220
    split_vma+0x24/0x30
    sys_madvise+0x5da/0x7b0
    tracesys+0xe1/0xe6
    RIP anon_vma_interval_tree_verify+0xc/0xa0
    CR2: fffffffffffffff0

    Figured out by Bob Liu.

    Reported-by: Sasha Levin
    Cc: Bob Liu
    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

16 Nov, 2012

1 commit

  • It will be useful to be able to access global memory commitment from
    device drivers. On the Hyper-V platform, the host has a policy engine to
    balance the available physical memory amongst all competing virtual
    machines hosted on a given node. This policy engine is driven by a number
    of metrics including the memory commitment reported by the guests. The
    balloon driver for Linux on Hyper-V will use this function to retrieve
    guest memory commitment. This function is also used in Xen self
    ballooning code.

    [akpm@linux-foundation.org: coding-style tweak]
    Signed-off-by: K. Y. Srinivasan
    Acked-by: David Rientjes
    Acked-by: Dan Magenheimer
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     

09 Oct, 2012

12 commits

  • During mremap(), the destination VMA is generally placed after the
    original vma in rmap traversal order: in move_vma(), we always have
    new_pgoff >= vma->vm_pgoff, and as a result new_vma->vm_pgoff >=
    vma->vm_pgoff unless vma_merge() merged the new vma with an adjacent one.

    When the destination VMA is placed after the original in rmap traversal
    order, we can avoid taking the rmap locks in move_ptes().

    Essentially, this reintroduces the optimization that had been disabled in
    "mm anon rmap: remove anon_vma_moveto_tail". The difference is that we
    don't try to impose the rmap traversal order; instead we just rely on
    things being in the desired order in the common case and fall back to
    taking locks in the uncommon case. Also we skip the i_mmap_mutex in
    addition to the anon_vma lock: in both cases, the vmas are traversed in
    increasing vm_pgoff order with ties resolved in tree insertion order.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • anon_vma_clone() expects new_vma->vm_{start,end,pgoff} to be correctly set
    so that the new vma can be indexed on the anon interval tree.

    copy_vma() was failing to do that, which broke mremap().

    Signed-off-by: Michel Lespinasse
    Cc: Jiri Slaby
    Cc: Hugh Dickins
    Tested-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Add a CONFIG_DEBUG_VM_RB build option for the previously existing
    DEBUG_MM_RB code. Now that Andi Kleen modified it to avoid using
    recursive algorithms, we can expose it a bit more.

    Also extend this code to validate_mm() after stack expansion, and to check
    that the vma's start and last pgoffs have not changed since the nodes were
    inserted on the anon vma interval tree (as it is important that the nodes
    be reindexed after each such update).

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When a large VMA (anon or private file mapping) is first touched, which
    will populate its anon_vma field, and then split into many regions through
    the use of mprotect(), the original anon_vma ends up linking all of the
    vmas on a linked list. This can cause rmap to become inefficient, as we
    have to walk potentially thousands of irrelevent vmas before finding the
    one a given anon page might fall into.

    By replacing the same_anon_vma linked list with an interval tree (where
    each avc's interval is determined by its vma's start and last pgoffs), we
    can make rmap efficient for this use case again.

    While the change is large, all of its pieces are fairly simple.

    Most places that were walking the same_anon_vma list were looking for a
    known pgoff, so they can just use the anon_vma_interval_tree_foreach()
    interval tree iterator instead. The exception here is ksm, where the
    page's index is not known. It would probably be possible to rework ksm so
    that the index would be known, but for now I have decided to keep things
    simple and just walk the entirety of the interval tree there.

    When updating vma's that already have an anon_vma assigned, we must take
    care to re-index the corresponding avc's on their interval tree. This is
    done through the use of anon_vma_interval_tree_pre_update_vma() and
    anon_vma_interval_tree_post_update_vma(), which remove the avc's from
    their interval tree before the update and re-insert them after the update.
    The anon_vma stays locked during the update, so there is no chance that
    rmap would miss the vmas that are being updated.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • mremap() had a clever optimization where move_ptes() did not take the
    anon_vma lock to avoid a race with anon rmap users such as page migration.
    Instead, the avc's were ordered in such a way that the origin vma was
    always visited by rmap before the destination. This ordering and the use
    of page table locks rmap usage safe. However, we want to replace the use
    of linked lists in anon rmap with an interval tree, and this will make it
    harder to impose such ordering as the interval tree will always be sorted
    by the avc->vma->vm_pgoff value. For now, let's replace the
    anon_vma_moveto_tail() ordering function with proper anon_vma locking in
    move_ptes(). Once we have the anon interval tree in place, we will
    re-introduce an optimization to avoid taking these locks in the most
    common cases.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Fix an anon_vma locking issue in the following situation:

    - vma has no anon_vma
    - next has an anon_vma
    - vma is being shrunk / next is being expanded, due to an mprotect call

    We need to take next's anon_vma lock to avoid races with rmap users (such
    as page migration) while next is being expanded.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • People get confused by find_vma_prepare(), because it doesn't care about
    what it returns in its output args, when its callers won't be interested.

    Clarify by passing in end-of-range address too, and returning failure if
    any existing vma overlaps the new range: instead of returning an ambiguous
    vma which most callers then must check. find_vma_links() is a clearer
    name.

    This does revert 2.6.27's dfe195fb79e88 ("mm: fix uninitialized variables
    for find_vma_prepare callers"), but it looks like gcc 4.3.0 was one of
    those releases too eager to shout about uninitialized variables: only
    copy_vma() warns with 4.5.1 and 4.7.1, which a BUG on error silences.

    [hughd@google.com: fix warning, remove BUG()]
    Signed-off-by: Hugh Dickins
    Cc: Benny Halevy
    Acked-by: Hillf Danton
    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
    currently it lost original meaning but still has some effects:

    | effect | alternative flags
    -+------------------------+---------------------------------------------
    1| account as reserved_vm | VM_IO
    2| skip in core dump | VM_IO, VM_DONTDUMP
    3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
    4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

    This patch removes reserved_vm counter from mm_struct. Seems like nobody
    cares about it, it does not exported into userspace directly, it only
    reduces total_vm showed in proc.

    Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

    remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
    remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

    [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Currently the kernel sets mm->exe_file during sys_execve() and then tracks
    number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
    as this counter drops to zero kernel resets mm->exe_file to NULL. Plus it
    resets mm->exe_file at last mmput() when mm->mm_users drops to zero.

    VMA with VM_EXECUTABLE flag appears after mapping file with flag
    MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
    splitting, because sys_mmap ignores this flag. Usually binfmt module sets
    mm->exe_file and mmaps executable vmas with this file, they hold
    mm->exe_file while task is running.

    comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"),
    where all this stuff was introduced:

    > The kernel implements readlink of /proc/pid/exe by getting the file from
    > the first executable VMA. Then the path to the file is reconstructed and
    > reported as the result.
    >
    > Because of the VMA walk the code is slightly different on nommu systems.
    > This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
    > walking the VMAs to find the first executable file-backed VMA we store a
    > reference to the exec'd file in the mm_struct.
    >
    > That reference would prevent the filesystem holding the executable file
    > from being unmounted even after unmapping the VMAs. So we track the number
    > of VM_EXECUTABLE VMAs and drop the new reference when the last one is
    > unmapped. This avoids pinning the mounted filesystem.

    exe_file's vma accounting is hooked into every file mmap/unmmap and vma
    split/merge just to fix some hypothetical pinning fs from umounting by mm,
    which already unmapped all its executable files, but still alive.

    Seems like currently nobody depends on this behaviour. We can try to
    remove this logic and keep mm->exe_file until final mmput().

    mm->exe_file is still protected with mm->mmap_sem, because we want to
    change it via new sys_prctl(PR_SET_MM_EXE_FILE). Also via this syscall
    task can change its mm->exe_file and unpin mountpoint explicitly.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Move actual pte filling for non-linear file mappings into the new special
    vma operation: ->remap_pages().

    Filesystems must implement this method to get non-linear mapping support,
    if it uses filemap_fault() then generic_file_remap_pages() can be used.

    Now device drivers can implement this method and obtain nonlinear vma support.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf #arch/tile
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Merge VM_INSERTPAGE into VM_MIXEDMAP. VM_MIXEDMAP VMA can mix pure-pfn
    ptes, special ptes and normal ptes.

    Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like
    VM_PFNMAP. If driver populates whole VMA at mmap() it probably not
    expects page-faults.

    This patch removes special check from vma_wants_writenotify() which
    disables pages write tracking for VMA populated via vm_instert_page().
    BDI below mapped file should not use dirty-accounting, moreover
    do_wp_page() can handle this.

    vm_insert_page() still marks vma after first usage. Usually it is called
    from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to
    change vma->vm_flags. Caller must set VM_MIXEDMAP at mmap time if it
    wants to call this function from other places, for example from page-fault
    handler.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

27 Sep, 2012

1 commit


24 Aug, 2012

1 commit


22 Aug, 2012

1 commit

  • Occasionally an isolated BUG_ON(mm->nr_ptes) gets reported, indicating
    that not all the page tables allocated could be found and freed when
    exit_mmap() tore down the user address space.

    There's usually nothing we can say about it, beyond that it's probably a
    sign of some bad memory or memory corruption; though it might still
    indicate a bug in vma or page table management (and did recently reveal a
    race in THP, fixed a few months ago).

    But one overdue change we can make is from BUG_ON to WARN_ON.

    It's fairly likely that the system will crash shortly afterwards in some
    other way (for example, the BUG_ON(page_mapped(page)) in
    __delete_from_page_cache(), once an inode mapped into the lost page tables
    gets evicted); but might tell us more before that.

    Change the BUG_ON(page_mapped) to WARN_ON too? Later perhaps: I'm less
    eager, since that one has several times led to fixes.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins