08 Jan, 2009

1 commit

  • Make VMAs per mm_struct as for MMU-mode linux. This solves two problems:

    (1) In SYSV SHM where nattch for a segment does not reflect the number of
    shmat's (and forks) done.

    (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an
    exec'ing process when VM_EXECUTABLE is specified, regardless of the fact
    that a VMA might be shared and already have its vm_mm assigned to another
    process or a dead process.

    A new struct (vm_region) is introduced to track a mapped region and to remember
    the circumstances under which it may be shared and the vm_list_struct structure
    is discarded as it's no longer required.

    This patch makes the following additional changes:

    (1) Regions are now allocated with alloc_pages() rather than kmalloc() and
    with no recourse to __GFP_COMP, so the pages are not composite. Instead,
    each page has a reference on it held by the region. Anything else that is
    interested in such a page will have to get a reference on it to retain it.
    When the pages are released due to unmapping, each page is passed to
    put_page() and will be freed when the page usage count reaches zero.

    (2) Excess pages are trimmed after an allocation as the allocation must be
    made as a power-of-2 quantity of pages.

    (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may
    end up with overlapping VMAs within the tree, the VMA struct address is
    appended to the sort key.

    (4) Non-anonymous VMAs are now added to the backing inode's prio list.

    (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of
    the backing region. The VMA and region structs will be split if
    necessary.

    (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory
    segment instead of all the attachments at that addresss. Multiple
    shmat()'s return the same address under NOMMU-mode instead of different
    virtual addresses as under MMU-mode.

    (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode.

    (8) /proc/maps is now the global list of mapped regions, and may list bits
    that aren't actually mapped anywhere.

    (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount
    of RAM currently allocated by mmap to hold mappable regions that can't be
    mapped directly. These are copies of the backing device or file if not
    anonymous.

    These changes make NOMMU mode more similar to MMU mode. The downside is that
    NOMMU mode requires some extra memory to track things over NOMMU without this
    patch (VMAs are no longer shared, and there are now region structs).

    Signed-off-by: David Howells
    Tested-by: Mike Frysinger
    Acked-by: Paul Mundt

    David Howells
     

07 Jan, 2009

3 commits

  • When dup_mmap() ooms we can end up with mm->mmap == NULL. The error
    path does mmput() and unmap_vmas() gets a NULL vma which it
    dereferences.

    In exit_mmap() there is nothing to do at all for this case, we can
    cancel the callpath right there.

    [akpm@linux-foundation.org: add sorely-needed comment]
    Signed-off-by: Johannes Weiner
    Reported-by: Akinobu Mita
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • xacct_add_tsk() relies on do_exit()->update_hiwater_xxx() and uses
    mm->hiwater_xxx directly, this leads to 2 problems:

    - taskstats_user_cmd() can call fill_pid()->xacct_add_tsk() at any
    moment before the task exits, so we should check the current values of
    rss/vm anyway.

    - do_exit()->update_hiwater_xxx() calls are racy. An exiting thread can
    be preempted right before mm->hiwater_xxx = new_val, and another thread
    can use A_LOT of memory and exit in between. When the first thread
    resumes it can be the last thread in the thread group, in that case we
    report the wrong hiwater_xxx values which do not take A_LOT into
    account.

    Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() helpers and change
    xacct_add_tsk() to use them. The first helper will also be used by
    rusage->ru_maxrss accounting.

    Kill do_exit()->update_hiwater_xxx() calls. Unless we are going to
    decrease rss/vm there is no point to update mm->hiwater_xxx, and nobody
    can look at this mm_struct when exit_mmap() actually unmaps the memory.

    Signed-off-by: Oleg Nesterov
    Acked-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Fix a little of the coding style in mm/mmap.c

    [akpm@linux-foundation.org: cleanup]
    Signed-off-by: ZhenwenXu
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    ZhenwenXu
     

06 Jan, 2009

1 commit


13 Nov, 2008

1 commit

  • The STACK_GROWSUP case of stack expansion was missing a test for 'prev',
    which got removed by commit cb8f488c33539f096580e202f5438a809195008f
    ("mmap.c: deinline a few functions") by mistake.

    I found my original email in "sent" folder. The patch in that mail
    does NOT remove !prev. That change had beed added by someone else.

    Ok, I think we are not much interested in who did it, let's
    fix it for good.

    [ "It looks like this was caused by me fixing rejects. That was the
    fancy include-lots-of-context-so-it-wont-apply patch." - akpm ]

    Reported-and-bisected-by: Helge Deller
    Signed-off-by: Denys Vlasenko
    Cc: Andrew Morton
    Cc: Jiri Kosina
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     

31 Oct, 2008

1 commit

  • Junjiro R. Okajima reported a problem where knfsd crashes if you are
    using it to export shmemfs objects and run strict overcommit. In this
    situation the current->mm based modifier to the overcommit goes through a
    NULL pointer.

    We could simply check for NULL and skip the modifier but we've caught
    other real bugs in the past from mm being NULL here - cases where we did
    need a valid mm set up (eg the exec bug about a year ago).

    To preserve the checks and get the logic we want shuffle the checking
    around and add a new helper to the vm_ security wrappers

    Also fix a current->mm reference in nommu that should use the passed mm

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix build]
    Reported-by: Junjiro R. Okajima
    Acked-by: James Morris
    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     

20 Oct, 2008

3 commits

  • __vma_link_file and expand_downwards functions are not small, yeat they
    are marked inline. They probably had one callsite sometime in the past,
    but now they have more. In order to prevent similar thing, I also
    deinlined expand_upwards, despite it having only pne callsite. Nowadays
    gcc auto-inlines such static functions anyway. In find_extend_vma, I
    removed one extra level of indirection.

    Patch is deliberately generated with -U $BIGNUM to make
    it easier to see that functions are big.

    Result:

    # size */*/mmap.o */vmlinux
    text data bss dec hex filename
    9514 188 16 9718 25f6 0.org/mm/mmap.o
    9237 188 16 9441 24e1 deinline/mm/mmap.o
    6124402 858996 389480 7372878 70804e 0.org/vmlinux
    6124113 858996 389480 7372589 707f2d deinline/vmlinux

    Signed-off-by: Denys Vlasenko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     
  • Originally by Nick Piggin

    Remove mlocked pages from the LRU using "unevictable infrastructure"
    during mmap(), munmap(), mremap() and truncate(). Try to move back to
    normal LRU lists on munmap() when last mlocked mapping removed. Remove
    PageMlocked() status when page truncated from file.

    [akpm@linux-foundation.org: cleanup]
    [kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
    [kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
    [lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
    [akpm@linux-foundation.org: remove bogus kerneldoc token]
    Signed-off-by: Nick Piggin
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Make sure that mlocked pages also live on the unevictable LRU, so kswapd
    will not scan them over and over again.

    This is achieved through various strategies:

    1) add yet another page flag--PG_mlocked--to indicate that
    the page is locked for efficient testing in vmscan and,
    optionally, fault path. This allows early culling of
    unevictable pages, preventing them from getting to
    page_referenced()/try_to_unmap(). Also allows separate
    accounting of mlock'd pages, as Nick's original patch
    did.

    Note: Nick's original mlock patch used a PG_mlocked
    flag. I had removed this in favor of the PG_unevictable
    flag + an mlock_count [new page struct member]. I
    restored the PG_mlocked flag to eliminate the new
    count field.

    2) add the mlock/unevictable infrastructure to mm/mlock.c,
    with internal APIs in mm/internal.h. This is a rework
    of Nick's original patch to these files, taking into
    account that mlocked pages are now kept on unevictable
    LRU list.

    3) update vmscan.c:page_evictable() to check PageMlocked()
    and, if vma passed in, the vm_flags. Note that the vma
    will only be passed in for new pages in the fault path;
    and then only if the "cull unevictable pages in fault
    path" patch is included.

    4) add try_to_unlock() to rmap.c to walk a page's rmap and
    ClearPageMlocked() if no other vmas have it mlocked.
    Reuses as much of try_to_unmap() as possible. This
    effectively replaces the use of one of the lru list links
    as an mlock count. If this mechanism let's pages in mlocked
    vmas leak through w/o PG_mlocked set [I don't know that it
    does], we should catch them later in try_to_unmap(). One
    hopes this will be rare, as it will be relatively expensive.

    Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
    Signed-off-by: Nick Piggin

    splitlru: introduce __get_user_pages():

    New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
    because current get_user_pages() can't grab PROT_NONE pages theresore it
    cause PROT_NONE pages can't munlock.

    [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
    [akpm@linux-foundation.org: untangle patch interdependencies]
    [akpm@linux-foundation.org: fix things after out-of-order merging]
    [hugh@veritas.com: fix page-flags mess]
    [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
    [kosaki.motohiro@jp.fujitsu.com: build fix]
    [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
    [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Matt Mackall
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

04 Sep, 2008

1 commit

  • Anonymous mappings should ignore offset but shared anonymous mapping
    forgot to clear it and makes the following legit test program trigger
    SIGBUS.

    #include
    #include
    #include

    #define PAGE_SIZE 4096

    int main(void)
    {
    char *p;
    int i;

    p = mmap(NULL, 2 * PAGE_SIZE, PROT_READ|PROT_WRITE,
    MAP_SHARED|MAP_ANONYMOUS, -1, PAGE_SIZE);
    if (p == MAP_FAILED) {
    perror("mmap");
    return 1;
    }

    for (i = 0; i < 2; i++) {
    printf("page %d\n", i);
    p[i * 4096] = i;
    }
    return 0;
    }

    Fix it.

    Signed-off-by: Tejun Heo
    Acked-by: Hugh Dickins
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

12 Aug, 2008

1 commit


11 Aug, 2008

2 commits

  • Lockdep spotted:

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.27-rc1 #270
    -------------------------------------------------------
    qemu-kvm/2033 is trying to acquire lock:
    (&inode->i_data.i_mmap_lock){----}, at: [] mm_take_all_locks+0xc2/0xea

    but task is already holding lock:
    (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&anon_vma->lock){----}:
    [] __lock_acquire+0x11be/0x14d2
    [] lock_acquire+0x5e/0x7a
    [] _spin_lock+0x3b/0x47
    [] vma_adjust+0x200/0x444
    [] split_vma+0x12f/0x146
    [] mprotect_fixup+0x13c/0x536
    [] sys_mprotect+0x1a9/0x21e
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    -> #0 (&inode->i_data.i_mmap_lock){----}:
    [] __lock_acquire+0xedb/0x14d2
    [] lock_release_non_nested+0x1c2/0x219
    [] lock_release+0x127/0x14a
    [] _spin_unlock+0x1e/0x50
    [] mm_drop_all_locks+0x7f/0xb0
    [] do_mmu_notifier_register+0xe2/0x112
    [] mmu_notifier_register+0xe/0x10
    [] kvm_dev_ioctl+0x11e/0x287 [kvm]
    [] vfs_ioctl+0x2a/0x78
    [] do_vfs_ioctl+0x257/0x274
    [] sys_ioctl+0x55/0x78
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    other info that might help us debug this:

    5 locks held by qemu-kvm/2033:
    #0: (&mm->mmap_sem){----}, at: [] do_mmu_notifier_register+0x55/0x112
    #1: (mm_all_locks_mutex){--..}, at: [] mm_take_all_locks+0x34/0xea
    #2: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
    #3: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
    #4: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea

    stack backtrace:
    Pid: 2033, comm: qemu-kvm Not tainted 2.6.27-rc1 #270

    Call Trace:
    [] print_circular_bug_tail+0xb8/0xc3
    [] __lock_acquire+0xedb/0x14d2
    [] ? add_lock_to_list+0x7e/0xad
    [] ? mm_take_all_locks+0x70/0xea
    [] ? mm_take_all_locks+0x70/0xea
    [] lock_release_non_nested+0x1c2/0x219
    [] ? mm_take_all_locks+0xc2/0xea
    [] ? mm_take_all_locks+0xc2/0xea
    [] ? trace_hardirqs_on_caller+0x4d/0x115
    [] ? mm_drop_all_locks+0x7f/0xb0
    [] lock_release+0x127/0x14a
    [] _spin_unlock+0x1e/0x50
    [] mm_drop_all_locks+0x7f/0xb0
    [] do_mmu_notifier_register+0xe2/0x112
    [] mmu_notifier_register+0xe/0x10
    [] kvm_dev_ioctl+0x11e/0x287 [kvm]
    [] ? file_has_perm+0x83/0x8e
    [] vfs_ioctl+0x2a/0x78
    [] do_vfs_ioctl+0x257/0x274
    [] sys_ioctl+0x55/0x78
    [] system_call_fastpath+0x16/0x1b

    Which the locking hierarchy in mm/rmap.c confirms as valid.

    Fix this by first taking all the mapping->i_mmap_lock instances and then
    take all anon_vma->lock instances.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The nesting is correct due to holding mmap_sem, use the new annotation
    to annotate this.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

06 Aug, 2008

1 commit

  • gcc 4.3.0 correctly emits the following warnings.
    When a vma covering addr is found, find_vma_prepare indeed returns without
    setting pprev, rb_link, and rb_parent.

    mm/mmap.c: In function `insert_vm_struct':
    mm/mmap.c:2085: warning: `rb_parent' may be used uninitialized in this function
    mm/mmap.c:2085: warning: `rb_link' may be used uninitialized in this function
    mm/mmap.c:2084: warning: `prev' may be used uninitialized in this function
    mm/mmap.c: In function `copy_vma':
    mm/mmap.c:2124: warning: `rb_parent' may be used uninitialized in this function
    mm/mmap.c:2124: warning: `rb_link' may be used uninitialized in this function
    mm/mmap.c:2123: warning: `prev' may be used uninitialized in this function
    mm/mmap.c: In function `do_brk':
    mm/mmap.c:1951: warning: `rb_parent' may be used uninitialized in this function
    mm/mmap.c:1951: warning: `rb_link' may be used uninitialized in this function
    mm/mmap.c:1949: warning: `prev' may be used uninitialized in this function
    mm/mmap.c: In function `mmap_region':
    mm/mmap.c:1092: warning: `rb_parent' may be used uninitialized in this function
    mm/mmap.c:1092: warning: `rb_link' may be used uninitialized in this function
    mm/mmap.c:1089: warning: `prev' may be used uninitialized in this function

    Hugh adds: in fact, none of find_vma_prepare's callers use those values
    when a vma is found to be already covering addr, it's either an error or
    an occasion to munmap and repeat. Okay, let's quieten the compiler (but I
    would prefer it if pprev, rb_link and rb_parent were meaningful in that
    case, rather than whatever's in them from descending the tree).

    Signed-off-by: Benny Halevy
    Signed-off-by: Hugh Dickins
    Cc: "Ryan Hope"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benny Halevy
     

29 Jul, 2008

2 commits

  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • mm_take_all_locks holds off reclaim from an entire mm_struct. This allows
    mmu notifiers to register into the mm at any time with the guarantee that
    no mmu operation is in progress on the mm.

    This operation locks against the VM for all pte/vma/mm related operations
    that could ever happen on a certain mm. This includes vmtruncate,
    try_to_unmap, and all page faults.

    The caller must take the mmap_sem in write mode before calling
    mm_take_all_locks(). The caller isn't allowed to release the mmap_sem
    until mm_drop_all_locks() returns.

    mmap_sem in write mode is required in order to block all operations that
    could modify pagetables and free pages without need of altering the vma
    layout (for example populate_range() with nonlinear vmas). It's also
    needed in write mode to avoid new anon_vmas to be associated with existing
    vmas.

    A single task can't take more than one mm_take_all_locks() in a row or it
    would deadlock.

    mm_take_all_locks() and mm_drop_all_locks are expensive operations that
    may have to take thousand of locks.

    mm_take_all_locks() can fail if it's interrupted by signals.

    When mmu_notifier_register returns, we must be sure that the driver is
    notified if some task is in the middle of a vmtruncate for the 'mm' where
    the mmu notifier was registered (mmu_notifier_invalidate_range_start/end
    is run around the vmtruncation but mmu_notifier_register can run after
    mmu_notifier_invalidate_range_start and before
    mmu_notifier_invalidate_range_end). Same problem for rmap paths. And
    we've to remove page pinning to avoid replicating the tlb_gather logic
    inside KVM (and GRU doesn't work well with page pinning regardless of
    needing tlb_gather), so without mm_take_all_locks when vmtruncate frees
    the page, kvm would have no way to notice that it mapped into sptes a page
    that is going into the freelist without a chance of any further
    mmu_notifier notification.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Linus Torvalds
    Cc: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Jul, 2008

3 commits

  • The goal of this patchset is to support multiple hugetlb page sizes. This
    is achieved by introducing a new struct hstate structure, which
    encapsulates the important hugetlb state and constants (eg. huge page
    size, number of huge pages currently allocated, etc).

    The hstate structure is then passed around the code which requires these
    fields, they will do the right thing regardless of the exact hstate they
    are operating on.

    This patch adds the hstate structure, with a single global instance of it
    (default_hstate), and does the basic work of converting hugetlb to use the
    hstate.

    Future patches will add more hstate structures to allow for different
    hugetlbfs mounts to have different page sizes.

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • With Mel's hugetlb private reservation support patches applied, strict
    overcommit semantics are applied to both shared and private huge page
    mappings. This can be a problem if an application relied on unlimited
    overcommit semantics for private mappings. An example of this would be an
    application which maps a huge area with the intention of using it very
    sparsely. These application would benefit from being able to opt-out of
    the strict overcommit. It should be noted that prior to hugetlb
    supporting demand faulting all mappings were fully populated and so
    applications of this type should be rare.

    This patch stack implements the MAP_NORESERVE mmap() flag for huge page
    mappings. This flag has the same meaning as for small page mappings,
    suppressing reservations for that mapping.

    Thanks to Mel Gorman for reviewing a number of early versions of these
    patches.

    This patch:

    When a small page mapping is created with mmap() reservations are created
    by default for any memory pages required. When the region is read/write
    the reservation is increased for every page, no reservation is needed for
    read-only regions (as they implicitly share the zero page). Reservations
    are tracked via the VM_ACCOUNT vma flag which is present when the region
    has reservation backing it. When we convert a region from read-only to
    read-write new reservations are aquired and VM_ACCOUNT is set. However,
    when a read-only map is created with MAP_NORESERVE it is indistinguishable
    from a normal mapping. When we then convert that to read/write we are
    forced to incorrectly create reservations for it as we have no record of
    the original MAP_NORESERVE.

    This patch introduces a new vma flag VM_NORESERVE which records the
    presence of the original MAP_NORESERVE flag. This allows us to
    distinguish these two circumstances and correctly account the reserve.

    As well as fixing this FIXME in the code, this makes it much easier to
    introduce MAP_NORESERVE support for huge pages as this flag is available
    consistantly for the life of the mapping. VM_ACCOUNT on the other hand is
    heavily used at the generic level in association with small pages.

    Signed-off-by: Andy Whitcroft
    Cc: Mel Gorman
    Cc: Adam Litke
    Cc: Johannes Weiner
    Cc: Andy Whitcroft
    Cc: William Lee Irwin III
    Cc: Hugh Dickins
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • The double indirection here is not needed anywhere and hence (at least)
    confusing.

    Signed-off-by: Jan Beulich
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Luck, Tony"
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Acked-by: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     

09 Jul, 2008

1 commit

  • This patch allows architectures to define functions to deal with
    additional protections bits for mmap() and mprotect().

    arch_calc_vm_prot_bits() maps additonal protection bits to vm_flags
    arch_vm_get_page_prot() maps additional vm_flags to the vma's vm_page_prot
    arch_validate_prot() checks for valid values of the protection bits

    Note: vm_get_page_prot() is now pretty ugly, but the generated code
    should be identical for architectures that don't define additional
    protection bits.

    Signed-off-by: Dave Kleikamp
    Acked-by: Andrew Morton
    Acked-by: Hugh Dickins
    Signed-off-by: Benjamin Herrenschmidt

    Dave Kleikamp
     

07 Jun, 2008

1 commit

  • Fix a regression introduced by

    commit 4cc6028d4040f95cdb590a87db478b42b8be0508
    Author: Jiri Kosina
    Date: Wed Feb 6 22:39:44 2008 +0100

    brk: check the lower bound properly

    The check in sys_brk() on minimum value the brk might have must take
    CONFIG_COMPAT_BRK setting into account. When this option is turned on
    (i.e. we support ancient legacy binaries, e.g. libc5-linked stuff), the
    lower bound on brk value is mm->end_code, otherwise the brk start is
    allowed to be arbitrarily shifted.

    Signed-off-by: Jiri Kosina
    Tested-by: Geert Uytterhoeven
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     

25 May, 2008

1 commit

  • The atomic_t type is 32bit but a 64bit system can have more than 2^32
    pages of virtual address space available. Without this we overflow on
    ludicrously large mappings

    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     

29 Apr, 2008

1 commit

  • The kernel implements readlink of /proc/pid/exe by getting the file from
    the first executable VMA. Then the path to the file is reconstructed and
    reported as the result.

    Because of the VMA walk the code is slightly different on nommu systems.
    This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
    walking the VMAs to find the first executable file-backed VMA we store a
    reference to the exec'd file in the mm_struct.

    That reference would prevent the filesystem holding the executable file
    from being unmounted even after unmapping the VMAs. So we track the number
    of VM_EXECUTABLE VMAs and drop the new reference when the last one is
    unmapped. This avoids pinning the mounted filesystem.

    [akpm@linux-foundation.org: improve comments]
    [yamamoto@valinux.co.jp: fix dup_mmap]
    Signed-off-by: Matt Helsley
    Cc: Oleg Nesterov
    Cc: David Howells
    Cc:"Eric W. Biederman"
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Hugh Dickins
    Signed-off-by: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     

28 Apr, 2008

3 commits

  • This patch renames mpol_copy() to mpol_dup() because, well, that's what it
    does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
    existing mempolicy, allocates a new one and copies the contents.

    In a later patch, I want to use the name mpol_copy() to copy the contents from
    one mempolicy to another like, e.g., strcpy() does for strings.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This is a change that was requested some time ago by Mel Gorman. Makes sense
    to me, so here it is.

    Note: I retain the name "mpol_free_shared_policy()" because it actually does
    free the shared_policy, which is NOT a reference counted object. However, ...

    The mempolicy object[s] referenced by the shared_policy are reference counted,
    so mpol_put() is used to release the reference held by the shared_policy. The
    mempolicy might not be freed at this time, because some task attached to the
    shared object associated with the shared policy may be in the process of
    allocating a page based on the mempolicy. In that case, the task performing
    the allocation will hold a reference on the mempolicy, obtained via
    mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
    holding such a reference have called mpol_put() for the mempolicy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • It is not easy to actually understand the "if (!file || !vma_merge())"
    code, turn it into "if (file && vma_merge())". This makes immediately
    obvious that the subsequent "if (file)" is superfluous.

    As Hugh Dickins pointed out, we can also factor out the ->i_writecount
    corrections, and add a small comment about that.

    Signed-off-by: Oleg Nesterov
    Cc: Miklos Szeredi
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

09 Feb, 2008

1 commit

  • Convert special mapping install from nopage to fault.

    Because the "vm_file" is NULL for the special mapping, the generic VM
    code has messed up "vm_pgoff" thinking that it's an anonymous mapping
    and the offset does't matter. For that reason, we need to undo the
    vm_pgoff offset that got added into vmf->pgoff.

    [ We _really_ should clean that up - either by making this whole special
    mapping code just use a real file entry rather than that ugly array of
    "struct page" pointers, or by just making the VM code realize that
    even if vm_file is NULL it may not be a regular anonymous mmap.
    - Linus ]

    Signed-off-by: Nick Piggin
    Cc: linux-mm@kvack.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

07 Feb, 2008

1 commit

  • There is a check in sys_brk(), that tries to make sure that we do not
    underflow the area that is dedicated to brk heap.

    The check is however wrong, as it assumes that brk area starts immediately
    after the end of the code (+bss), which is wrong for example in
    environments with randomized brk start. The proper way is to check whether
    the address is not below the start_brk address.

    Signed-off-by: Jiri Kosina
    Signed-off-by: Ingo Molnar

    Jiri Kosina
     

06 Feb, 2008

1 commit

  • In order to change the layout of the page tables after an mmap has crossed the
    adress space limit of the current page table layout a architecture hook in
    get_unmapped_area is needed. The arguments are the address of the new mapping
    and the length of it.

    Cc: Benjamin Herrenschmidt
    Signed-off-by: Martin Schwidefsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     

04 Feb, 2008

1 commit

  • Drivers that register a ->fault handler, but do not range-check the
    offset argument, must set VM_DONTEXPAND in the vm_flags in order to
    prevent an expanding mremap from overflowing the resource.

    I've audited the tree and attempted to fix these problems (usually by
    adding VM_DONTEXPAND where it is not obvious).

    Signed-off-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

30 Jan, 2008

1 commit

  • Randomize the location of the heap (brk) for i386 and x86_64. The range is
    randomized in the range starting at current brk location up to 0x02000000
    offset for both architectures. This, together with
    pie-executable-randomization.patch and
    pie-executable-randomization-fix.patch, should make the address space
    randomization on i386 and x86_64 complete.

    Arjan says:

    This is known to break older versions of some emacs variants, whose dumper
    code assumed that the last variable declared in the program is equal to the
    start of the dynamically allocated memory region.

    (The dumper is the code where emacs effectively dumps core at the end of it's
    compilation stage; this coredump is then loaded as the main program during
    normal use)

    iirc this was 5 years or so; we found this way back when I was at RH and we
    first did the security stuff there (including this brk randomization). It
    wasn't all variants of emacs, and it got fixed as a result (I vaguely remember
    that emacs already had code to deal with it for other archs/oses, just
    ifdeffed wrongly).

    It's a rare and wrong assumption as a general thing, just on x86 it mostly
    happened to be true (but to be honest, it'll break too if gcc does
    something fancy or if the linker does a non-standard order). Still its
    something we should at least document.

    Note 2: afaik it only broke the emacs *build*. I'm not 100% sure about that
    (it IS 5 years ago) though.

    [ akpm@linux-foundation.org: deuglification ]

    Signed-off-by: Jiri Kosina
    Cc: Arjan van de Ven
    Cc: Roland McGrath
    Cc: Jakub Jelinek
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Jiri Kosina
     

25 Jan, 2008

1 commit


05 Dec, 2007

3 commits

  • Given a specifically crafted binary do_brk() can be used to get low
    pages available in userspace virtually memory and can thus be used to
    circumvent the mmap_min_addr low memory protection. Add security checks
    in do_brk().

    Signed-off-by: Eric Paris
    Acked-by: Alan Cox
    Signed-off-by: James Morris

    Eric Paris
     
  • If mmap_min_addr is set and a process attempts to mmap (not fixed) with a
    non-null hint address less than mmap_min_addr the mapping will fail the
    security checks. Since this is just a hint address this patch will round
    such a hint address above mmap_min_addr.

    gcj was found to try to be very frugal with vm usage and give hint addresses
    in the 8k-32k range. Without this patch all such programs failed and with
    the patch they happily get a higher address.

    This patch is wrappad in CONFIG_SECURITY since mmap_min_addr doesn't exist
    without it and there would be no security check possible no matter what. So
    we should not bother compiling in this rounding if it is just a waste of
    time.

    Signed-off-by: Eric Paris
    Signed-off-by: James Morris

    Eric Paris
     
  • Add security checks to make sure we are not attempting to expand the
    stack into memory protected by mmap_min_addr

    Signed-off-by: Eric Paris
    Signed-off-by: James Morris

    Eric Paris
     

23 Oct, 2007

1 commit

  • Fix mprotect bug in recent commit 3ed75eb8f1cd89565966599c4f77d2edb086d5b0
    (setup vma->vm_page_prot by vm_get_page_prot()): the vma_wants_writenotify
    case was setting the same prot as when not.

    Nothing wrong with the use of protection_map[] in mmap_region(),
    but use vm_get_page_prot() there too in the same ~VM_SHARED way.

    Signed-off-by: Hugh Dickins
    Cc: Coly Li
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

20 Oct, 2007

1 commit

  • This patch uses vm_get_page_prot() to setup vma->vm_page_prot.

    Though inside vm_get_page_prot() the protection flags is AND with
    (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED), it does not hurt correct code.

    Signed-off-by: Coly Li
    Cc: Hugh Dickins
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Coly Li
     

17 Oct, 2007

2 commits

  • This patch contains the following cleanups that are now possible:
    - remove the unused security_operations->inode_xattr_getsuffix
    - remove the no longer used security_operations->unregister_security
    - remove some no longer required exit code
    - remove a bunch of no longer used exports

    Signed-off-by: Adrian Bunk
    Acked-by: James Morris
    Cc: Chris Wright
    Cc: Stephen Smalley
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • mm.h doesn't use directly anything from mutex.h and backing-dev.h, so
    remove them and add them back to files which need them.

    Cross-compile tested on many configs and archs.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan