03 Apr, 2009

1 commit

  • Fix a number of issues with the per-MM VMA patch:

    (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
    a NOMMU system with more than 2G pages. Makes no difference on a 32-bit
    system.

    (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
    lest it overflow.

    (3) Move the allocation of the vm_area_struct slab back for fork.c.

    (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.

    (5) Use BUG_ON() rather than if () BUG().

    (6) Make the default validate_nommu_regions() a static inline rather than a
    #define.

    (7) Make free_page_series()'s objection to pages with a refcount != 1 more
    informative.

    (8) Adjust the __put_nommu_region() banner comment to indicate that the
    semaphore must be held for writing.

    (9) Limit the number of warnings about munmaps of non-mmapped regions.

    Reported-by: Andrew Morton
    Signed-off-by: David Howells
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

01 Apr, 2009

2 commits

  • Change the page_mkwrite prototype to take a struct vm_fault, and return
    VM_FAULT_xxx flags. There should be no functional change.

    This makes it possible to return much more detailed error information to
    the VM (and also can provide more information eg. virtual_address to the
    driver, which might be important in some special cases).

    This is required for a subsequent fix. And will also make it easier to
    merge page_mkwrite() with fault() in future.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Trond Myklebust
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Artem Bityutskiy
    Cc: Felix Blyakher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Add a helper function account_page_dirtied(). Use that from two
    callsites. reiser4 adds a function which adds a third callsite.

    Signed-off-by: Edward Shishkin
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Edward Shishkin
     

14 Mar, 2009

1 commit


13 Mar, 2009

1 commit

  • Impact: fix false positive PAT warnings - also fix VirtalBox hang

    Use of vma->vm_pgoff to identify the pfnmaps that are fully
    mapped at mmap time is broken. vm_pgoff is set by generic mmap
    code even for cases where drivers are setting up the mappings
    at the fault time.

    The problem was originally reported here:

    http://marc.info/?l=linux-kernel&m=123383810628583&w=2

    Change is_linear_pfn_mapping logic to overload VM_INSERTPAGE
    flag along with VM_PFNMAP to mean full PFNMAP setup at mmap
    time.

    Problem also tracked at:

    http://bugzilla.kernel.org/show_bug.cgi?id=12800

    Reported-by: Thomas Hellstrom
    Tested-by: Frans Pop
    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha @intel.com>
    Cc: Nick Piggin
    Cc: "ebiederm@xmission.com"
    Cc: # only for 2.6.29.1, not .28
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Pallipadi, Venkatesh
     

19 Feb, 2009

2 commits

  • What's happening is that the assertion in mm/page_alloc.c:move_freepages()
    is triggering:

    BUG_ON(page_zone(start_page) != page_zone(end_page));

    Once I knew this is what was happening, I added some annotations:

    if (unlikely(page_zone(start_page) != page_zone(end_page))) {
    printk(KERN_ERR "move_freepages: Bogus zones: "
    "start_page[%p] end_page[%p] zone[%p]\n",
    start_page, end_page, zone);
    printk(KERN_ERR "move_freepages: "
    "start_zone[%p] end_zone[%p]\n",
    page_zone(start_page), page_zone(end_page));
    printk(KERN_ERR "move_freepages: "
    "start_pfn[0x%lx] end_pfn[0x%lx]\n",
    page_to_pfn(start_page), page_to_pfn(end_page));
    printk(KERN_ERR "move_freepages: "
    "start_nid[%d] end_nid[%d]\n",
    page_to_nid(start_page), page_to_nid(end_page));
    ...

    And here's what I got:

    move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
    move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
    move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
    move_freepages: start_nid[1] end_nid[0]

    My memory layout on this box is:

    [ 0.000000] Zone PFN ranges:
    [ 0.000000] Normal 0x00000000 -> 0x0081ff5d
    [ 0.000000] Movable zone start PFN for each node
    [ 0.000000] early_node_map[8] active PFN ranges
    [ 0.000000] 0: 0x00000000 -> 0x00020000
    [ 0.000000] 1: 0x00800000 -> 0x0081f7ff
    [ 0.000000] 1: 0x0081f800 -> 0x0081fe50
    [ 0.000000] 1: 0x0081fed1 -> 0x0081fed8
    [ 0.000000] 1: 0x0081feda -> 0x0081fedb
    [ 0.000000] 1: 0x0081fedd -> 0x0081fee5
    [ 0.000000] 1: 0x0081fee7 -> 0x0081ff51
    [ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d

    So it's a block move in that 0x81f600-->0x81f7ff region which triggers
    the problem.

    This patch:

    Declaration of early_pfn_to_nid() is scattered over per-arch include
    files, and it seems it's complicated to know when the declaration is used.
    I think it makes fix-for-memmap-init not easy.

    This patch moves all declaration to include/linux/mm.h

    After this,
    if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
    -> Use static definition in include/linux/mm.h
    else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
    -> Use generic definition in mm/page_alloc.c
    else
    -> per-arch back end function will be called.

    Signed-off-by: KAMEZAWA Hiroyuki
    Tested-by: KOSAKI Motohiro
    Reported-by: David Miller
    Cc: Mel Gorman
    Cc: Heiko Carstens
    Cc: [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for
    cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty).

    Additionally, there is some inconsistency about when task_dirty_inc is
    called. It is used for dirty balancing, however it even gets called for
    __set_page_dirty_no_writeback.

    So rather than increment it in a set_page_dirty wrapper, move it down to
    exactly where the dirty page accounting stats are incremented.

    Cc: YAMAMOTO Takashi
    Signed-off-by: Nick Piggin
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

18 Feb, 2009

1 commit

  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, vm86: fix preemption bug
    x86, olpc: fix model detection without OFW
    x86, hpet: fix for LS21 + HPET = boot hang
    x86: CPA avoid repeated lazy mmu flush
    x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context
    x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption
    x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem
    x86/cpa: make sure cpa is safe to call in lazy mmu mode
    x86, ptrace, mm: fix double-free on race

    Linus Torvalds
     

11 Feb, 2009

2 commits

  • Ptrace_detach() races with __ptrace_unlink() if the traced task is
    reaped while detaching. This might cause a double-free of the BTS
    buffer.

    Change the ptrace_detach() path to only do the memory accounting in
    ptrace_bts_detach() and leave the buffer free to ptrace_bts_untrace()
    which will be called from __ptrace_unlink().

    The fix follows a proposal from Oleg Nesterov.

    Reported-by: Oleg Nesterov
    Signed-off-by: Markus Metzger
    Signed-off-by: Ingo Molnar

    Markus Metzger
     
  • When overcommit is disabled, the core VM accounts for pages used by anonymous
    shared, private mappings and special mappings. It keeps track of VMAs that
    should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
    with VM_NORESERVE.

    Overcommit for hugetlbfs is much riskier than overcommit for base pages
    due to contiguity requirements. It avoids overcommiting on both shared and
    private mappings using reservation counters that are checked and updated
    during mmap(). This ensures (within limits) that hugepages exist in the
    future when faults occurs or it is too easy to applications to be SIGKILLed.

    As hugetlbfs makes its own reservations of a different unit to the base page
    size, VM_ACCOUNT should never be set. Even if the units were correct, we would
    double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
    be set because an application can request no reserves be made for hugetlbfs
    at the risk of getting killed later.

    With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
    VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
    breaks the accounting for both the core VM and hugetlbfs, can trigger an
    OOM storm when hugepage pools are too small lockups and corrupted counters
    otherwise are used. This patch brings hugetlbfs more in line with how the
    core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

14 Jan, 2009

1 commit

  • This assertion is incorrect for lockless pagecache. By definition if we
    have an unpinned page that we are trying to take a speculative reference
    to, it may become the tail of a compound page at any time (if it is
    freed, then reallocated as a compound page).

    It was still a valid assertion for the vmscan.c LRU isolation case, but
    it doesn't seem incredibly helpful... if somebody wants it, they can
    put it back directly where it applies in the vmscan code.

    Signed-off-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

08 Jan, 2009

1 commit

  • Make VMAs per mm_struct as for MMU-mode linux. This solves two problems:

    (1) In SYSV SHM where nattch for a segment does not reflect the number of
    shmat's (and forks) done.

    (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an
    exec'ing process when VM_EXECUTABLE is specified, regardless of the fact
    that a VMA might be shared and already have its vm_mm assigned to another
    process or a dead process.

    A new struct (vm_region) is introduced to track a mapped region and to remember
    the circumstances under which it may be shared and the vm_list_struct structure
    is discarded as it's no longer required.

    This patch makes the following additional changes:

    (1) Regions are now allocated with alloc_pages() rather than kmalloc() and
    with no recourse to __GFP_COMP, so the pages are not composite. Instead,
    each page has a reference on it held by the region. Anything else that is
    interested in such a page will have to get a reference on it to retain it.
    When the pages are released due to unmapping, each page is passed to
    put_page() and will be freed when the page usage count reaches zero.

    (2) Excess pages are trimmed after an allocation as the allocation must be
    made as a power-of-2 quantity of pages.

    (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may
    end up with overlapping VMAs within the tree, the VMA struct address is
    appended to the sort key.

    (4) Non-anonymous VMAs are now added to the backing inode's prio list.

    (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of
    the backing region. The VMA and region structs will be split if
    necessary.

    (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory
    segment instead of all the attachments at that addresss. Multiple
    shmat()'s return the same address under NOMMU-mode instead of different
    virtual addresses as under MMU-mode.

    (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode.

    (8) /proc/maps is now the global list of mapped regions, and may list bits
    that aren't actually mapped anywhere.

    (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount
    of RAM currently allocated by mmap to hold mappable regions that can't be
    mapped directly. These are copies of the backing device or file if not
    anonymous.

    These changes make NOMMU mode more similar to MMU mode. The downside is that
    NOMMU mode requires some extra memory to track things over NOMMU without this
    patch (VMAs are no longer shared, and there are now region structs).

    Signed-off-by: David Howells
    Tested-by: Mike Frysinger
    Acked-by: Paul Mundt

    David Howells
     

07 Jan, 2009

1 commit

  • Rather than have the pagefault handler kill a process directly if it gets
    a VM_FAULT_OOM, have it call into the OOM killer.

    With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
    oom killing throttling, oom priority adjustment or selective disabling,
    panic on oom, etc), it's silly to unconditionally kill the faulting
    process at page fault time. Create a hook for pagefault oom path to call
    into instead.

    Only converted x86 and uml so far.

    [akpm@linux-foundation.org: make __out_of_memory() static]
    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Nick Piggin
    Cc: Jeff Dike
    Acked-by: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

29 Dec, 2008

1 commit

  • …el/git/tip/linux-2.6-tip

    * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits)
    sched, trace: update trace_sched_wakeup()
    tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3
    Revert "x86: disable X86_PTRACE_BTS"
    ring-buffer: prevent false positive warning
    ring-buffer: fix dangling commit race
    ftrace: enable format arguments checking
    x86, bts: memory accounting
    x86, bts: add fork and exit handling
    ftrace: introduce tracing_reset_online_cpus() helper
    tracing: fix warnings in kernel/trace/trace_sched_switch.c
    tracing: fix warning in kernel/trace/trace.c
    tracing/ring-buffer: remove unused ring_buffer size
    trace: fix task state printout
    ftrace: add not to regex on filtering functions
    trace: better use of stack_trace_enabled for boot up code
    trace: add a way to enable or disable the stack tracer
    x86: entry_64 - introduce FTRACE_ frame macro v2
    tracing/ftrace: add the printk-msg-only option
    tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()
    x86, bts: correctly report invalid bts records
    ...

    Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits
    being already partly merged by the SH merge.

    Linus Torvalds
     

20 Dec, 2008

5 commits

  • Impact: move the BTS buffer accounting to the mlock bucket

    Add alloc_locked_buffer() and free_locked_buffer() functions to mm/mlock.c
    to kalloc a buffer and account the locked memory to current.

    Account the memory for the BTS buffer to the tracer.

    Signed-off-by: Markus Metzger
    Signed-off-by: Ingo Molnar

    Markus Metzger
     
  • Impact: Cleanup and branch hints only.

    Move the track and untrack pfn stub routines from memory.c to asm-generic.
    Also add unlikely to pfnmap related calls in fork and exit path.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: H. Peter Anvin

    venkatesh.pallipadi@intel.com
     
  • Impact: Cleanup - removes a new function in favor of a recently modified older one.

    Replace follow_pfnmap_pte in pat code with follow_phys. follow_phys lso
    returns protection eliminating the need of pte_pgprot call. Using follow_phys
    also eliminates the need for pte_pa.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: H. Peter Anvin

    venkatesh.pallipadi@intel.com
     
  • Impact: Changes and globalizes an existing static interface.

    Follow_phys does similar things as follow_pfnmap_pte. Make a minor change
    to follow_phys so that it can be used in place of follow_pfnmap_pte.
    Physical address return value with 0 as error return does not work in
    follow_phys as the actual physical address 0 mapping may exist in pte.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: H. Peter Anvin

    venkatesh.pallipadi@intel.com
     
  • Impact: Documentation only

    Incremental patches to address the review comments from Nick Piggin
    for v3 version of x86 PAT pfnmap changes patchset here

    http://lkml.indiana.edu/hypermail/linux/kernel/0812.2/01330.html

    This patch:

    Clarify is_linear_pfn_mapping() and its usage.

    It is used by x86 PAT code for performance reasons. Identifying pfnmap
    as linear over entire vma helps speedup reserve and free of memtype
    for the region.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: H. Peter Anvin

    venkatesh.pallipadi@intel.com
     

19 Dec, 2008

3 commits

  • Impact: Introduces new hooks, which are currently null.

    Introduce generic hooks in remap_pfn_range and vm_insert_pfn and
    corresponding copy and free routines with reserve and free tracking.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: H. Peter Anvin

    venkatesh.pallipadi@intel.com
     
  • Impact: New currently unused interface.

    Add a generic interface to follow pfn in a pfnmap vma range. This is used by
    one of the subsequent x86 PAT related patch to keep track of memory types
    for vma regions across vma copy and free.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: H. Peter Anvin

    venkatesh.pallipadi@intel.com
     
  • Impact: Code transformation, new functions added should have no effect.

    Drivers use mmap followed by pgprot_* and remap_pfn_range or vm_insert_pfn,
    in order to export reserved memory to userspace. Currently, such mappings are
    not tracked and hence not kept consistent with other mappings (/dev/mem,
    pci resource, ioremap) for the sme memory, that may exist in the system.

    The following patchset adds x86 PAT attribute tracking and untracking for
    pfnmap related APIs.

    First three patches in the patchset are changing the generic mm code to fit
    in this tracking. Last four patches are x86 specific to make things work
    with x86 PAT code. The patchset aso introduces pgprot_writecombine interface,
    which gives writecombine mapping when enabled, falling back to
    pgprot_noncached otherwise.

    This patch:

    While working on x86 PAT, we faced some hurdles with trackking
    remap_pfn_range() regions, as we do not have any information to say
    whether that PFNMAP mapping is linear for the entire vma range or
    it is smaller granularity regions within the vma.

    A simple solution to this is to use vm_pgoff as an indicator for
    linear mapping over the vma region. Currently, remap_pfn_range
    only sets vm_pgoff for COW mappings. Below patch changes the
    logic and sets the vm_pgoff irrespective of COW. This will still not
    be enough for the case where pfn is zero (vma region mapped to
    physical address zero). But, for all the other cases, we can look at
    pfnmap VMAs and say whether the mappng is for the entire vma region
    or not.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Suresh Siddha
    Signed-off-by: H. Peter Anvin

    venkatesh.pallipadi@intel.com
     

20 Oct, 2008

2 commits

  • Make sure that mlocked pages also live on the unevictable LRU, so kswapd
    will not scan them over and over again.

    This is achieved through various strategies:

    1) add yet another page flag--PG_mlocked--to indicate that
    the page is locked for efficient testing in vmscan and,
    optionally, fault path. This allows early culling of
    unevictable pages, preventing them from getting to
    page_referenced()/try_to_unmap(). Also allows separate
    accounting of mlock'd pages, as Nick's original patch
    did.

    Note: Nick's original mlock patch used a PG_mlocked
    flag. I had removed this in favor of the PG_unevictable
    flag + an mlock_count [new page struct member]. I
    restored the PG_mlocked flag to eliminate the new
    count field.

    2) add the mlock/unevictable infrastructure to mm/mlock.c,
    with internal APIs in mm/internal.h. This is a rework
    of Nick's original patch to these files, taking into
    account that mlocked pages are now kept on unevictable
    LRU list.

    3) update vmscan.c:page_evictable() to check PageMlocked()
    and, if vma passed in, the vm_flags. Note that the vma
    will only be passed in for new pages in the fault path;
    and then only if the "cull unevictable pages in fault
    path" patch is included.

    4) add try_to_unlock() to rmap.c to walk a page's rmap and
    ClearPageMlocked() if no other vmas have it mlocked.
    Reuses as much of try_to_unmap() as possible. This
    effectively replaces the use of one of the lru list links
    as an mlock count. If this mechanism let's pages in mlocked
    vmas leak through w/o PG_mlocked set [I don't know that it
    does], we should catch them later in try_to_unmap(). One
    hopes this will be rare, as it will be relatively expensive.

    Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
    Signed-off-by: Nick Piggin

    splitlru: introduce __get_user_pages():

    New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
    because current get_user_pages() can't grab PROT_NONE pages theresore it
    cause PROT_NONE pages can't munlock.

    [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
    [akpm@linux-foundation.org: untangle patch interdependencies]
    [akpm@linux-foundation.org: fix things after out-of-order merging]
    [hugh@veritas.com: fix page-flags mess]
    [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
    [kosaki.motohiro@jp.fujitsu.com: build fix]
    [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
    [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Matt Mackall
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
    kept on the normal LRU, since scanning them is a waste of time and might
    throw off kswapd's balancing algorithms. Place them on the unevictable
    LRU list instead.

    Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
    memory regions as unevictable. Then these pages will be culled off the
    normal LRU lists during vmscan.

    Add new wrapper function to clear the mapping's unevictable state when/if
    shared memory segment is munlocked.

    Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
    the shmem segment's mapping [struct address_space] for evictability now
    that they're no longer locked. If so, move them to the appropriate zone
    lru list.

    Changes depend on [CONFIG_]UNEVICTABLE_LRU.

    [kosaki.motohiro@jp.fujitsu.com: revert shm change]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Kosaki Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

12 Oct, 2008

1 commit


10 Sep, 2008

1 commit


17 Aug, 2008

1 commit

  • Try to comment away a little of the confusion between mm's vm_area_struct
    vm_flags and vmalloc's vm_struct flags: based on an idea by Ulrich Drepper.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Aug, 2008

1 commit


31 Jul, 2008

2 commits


29 Jul, 2008

1 commit

  • mm_take_all_locks holds off reclaim from an entire mm_struct. This allows
    mmu notifiers to register into the mm at any time with the guarantee that
    no mmu operation is in progress on the mm.

    This operation locks against the VM for all pte/vma/mm related operations
    that could ever happen on a certain mm. This includes vmtruncate,
    try_to_unmap, and all page faults.

    The caller must take the mmap_sem in write mode before calling
    mm_take_all_locks(). The caller isn't allowed to release the mmap_sem
    until mm_drop_all_locks() returns.

    mmap_sem in write mode is required in order to block all operations that
    could modify pagetables and free pages without need of altering the vma
    layout (for example populate_range() with nonlinear vmas). It's also
    needed in write mode to avoid new anon_vmas to be associated with existing
    vmas.

    A single task can't take more than one mm_take_all_locks() in a row or it
    would deadlock.

    mm_take_all_locks() and mm_drop_all_locks are expensive operations that
    may have to take thousand of locks.

    mm_take_all_locks() can fail if it's interrupted by signals.

    When mmu_notifier_register returns, we must be sure that the driver is
    notified if some task is in the middle of a vmtruncate for the 'mm' where
    the mmu notifier was registered (mmu_notifier_invalidate_range_start/end
    is run around the vmtruncation but mmu_notifier_register can run after
    mmu_notifier_invalidate_range_start and before
    mmu_notifier_invalidate_range_end). Same problem for rmap paths. And
    we've to remove page pinning to avoid replicating the tlb_gather logic
    inside KVM (and GRU doesn't work well with page pinning regardless of
    needing tlb_gather), so without mm_take_all_locks when vmtruncate frees
    the page, kvm would have no way to notice that it mapped into sptes a page
    that is going into the freelist without a chance of any further
    mmu_notifier notification.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Linus Torvalds
    Cc: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Jul, 2008

2 commits

  • This patch makes the needlessly global print_bad_pte() static.

    Signed-off-by: Adrian Bunk
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Introduce a new get_user_pages_fast mm API, which is basically a
    get_user_pages with a less general API (but still tends to be suited to
    the common case):

    - task and mm are always current and current->mm
    - force is always 0
    - pages is always non-NULL
    - don't pass back vmas

    This restricted API can be implemented in a much more scalable way on many
    architectures when the ptes are present, by walking the page tables
    locklessly (no mmap_sem or page table locks). When the ptes are not
    populated, get_user_pages_fast() could be slower.

    This is implemented locklessly on x86, and used in some key direct IO call
    sites, in later patches, which provides nearly 10% performance improvement
    on a threaded database workload.

    Lots of other code could use this too, depending on use cases (eg. grep
    drivers/). And it might inspire some new and clever ways to use it.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Nick Piggin
    Cc: Dave Kleikamp
    Cc: Andy Whitcroft
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Andi Kleen
    Cc: Dave Kleikamp
    Cc: Badari Pulavarty
    Cc: Zach Brown
    Cc: Jens Axboe
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

25 Jul, 2008

6 commits

  • On 32-bit architectures PAGE_ALIGN() truncates 64-bit values to the 32-bit
    boundary. For example:

    u64 val = PAGE_ALIGN(size);

    always returns a value < 4GB even if size is greater than 4GB.

    The problem resides in PAGE_MASK definition (from include/asm-x86/page.h for
    example):

    #define PAGE_SHIFT 12
    #define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT)
    #define PAGE_MASK (~(PAGE_SIZE-1))
    ...
    #define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)

    The "~" is performed on a 32-bit value, so everything in "and" with
    PAGE_MASK greater than 4GB will be truncated to the 32-bit boundary.
    Using the ALIGN() macro seems to be the right way, because it uses
    typeof(addr) for the mask.

    Also move the PAGE_ALIGN() definitions out of include/asm-*/page.h in
    include/linux/mm.h.

    See also lkml discussion: http://lkml.org/lkml/2008/6/11/237

    [akpm@linux-foundation.org: fix drivers/media/video/uvc/uvc_queue.c]
    [akpm@linux-foundation.org: fix v850]
    [akpm@linux-foundation.org: fix powerpc]
    [akpm@linux-foundation.org: fix arm]
    [akpm@linux-foundation.org: fix mips]
    [akpm@linux-foundation.org: fix drivers/media/video/pvrusb2/pvrusb2-dvb.c]
    [akpm@linux-foundation.org: fix drivers/mtd/maps/uclinux.c]
    [akpm@linux-foundation.org: fix powerpc]
    Signed-off-by: Andrea Righi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     
  • With Mel's hugetlb private reservation support patches applied, strict
    overcommit semantics are applied to both shared and private huge page
    mappings. This can be a problem if an application relied on unlimited
    overcommit semantics for private mappings. An example of this would be an
    application which maps a huge area with the intention of using it very
    sparsely. These application would benefit from being able to opt-out of
    the strict overcommit. It should be noted that prior to hugetlb
    supporting demand faulting all mappings were fully populated and so
    applications of this type should be rare.

    This patch stack implements the MAP_NORESERVE mmap() flag for huge page
    mappings. This flag has the same meaning as for small page mappings,
    suppressing reservations for that mapping.

    Thanks to Mel Gorman for reviewing a number of early versions of these
    patches.

    This patch:

    When a small page mapping is created with mmap() reservations are created
    by default for any memory pages required. When the region is read/write
    the reservation is increased for every page, no reservation is needed for
    read-only regions (as they implicitly share the zero page). Reservations
    are tracked via the VM_ACCOUNT vma flag which is present when the region
    has reservation backing it. When we convert a region from read-only to
    read-write new reservations are aquired and VM_ACCOUNT is set. However,
    when a read-only map is created with MAP_NORESERVE it is indistinguishable
    from a normal mapping. When we then convert that to read/write we are
    forced to incorrectly create reservations for it as we have no record of
    the original MAP_NORESERVE.

    This patch introduces a new vma flag VM_NORESERVE which records the
    presence of the original MAP_NORESERVE flag. This allows us to
    distinguish these two circumstances and correctly account the reserve.

    As well as fixing this FIXME in the code, this makes it much easier to
    introduce MAP_NORESERVE support for huge pages as this flag is available
    consistantly for the life of the mapping. VM_ACCOUNT on the other hand is
    heavily used at the generic level in association with small pages.

    Signed-off-by: Andy Whitcroft
    Cc: Mel Gorman
    Cc: Adam Litke
    Cc: Johannes Weiner
    Cc: Andy Whitcroft
    Cc: William Lee Irwin III
    Cc: Hugh Dickins
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • free_area_init_node() gets passed in the node id as well as the node
    descriptor. This is redundant as the function can trivially get the node
    descriptor itself by means of NODE_DATA() and the node's id.

    I checked all the users and NODE_DATA() seems to be usable everywhere
    from where this function is called.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The double indirection here is not needed anywhere and hence (at least)
    confusing.

    Signed-off-by: Jan Beulich
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Luck, Tony"
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Acked-by: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • In order to be able to debug things like the X server and programs using
    the PPC Cell SPUs, the debugger needs to be able to access device memory
    through ptrace and /proc/pid/mem.

    This patch:

    Add the generic_access_phys access function and put the hooks in place
    to allow access_process_vm to access device or PPC Cell SPU memory.

    [riel@redhat.com: Add documentation for the vm_ops->access function]
    Signed-off-by: Rik van Riel
    Signed-off-by: Benjamin Herrensmidt
    Cc: Dave Airlie
    Cc: Hugh Dickins
    Cc: Paul Mackerras
    Cc: Arnd Bergmann
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • There are no users of nopfn in the tree. Remove it.

    [hugh@veritas.com: fix build error]
    Signed-off-by: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

15 Jul, 2008

1 commit