04 Sep, 2008

1 commit

  • Anonymous mappings should ignore offset but shared anonymous mapping
    forgot to clear it and makes the following legit test program trigger
    SIGBUS.

    #include
    #include
    #include

    #define PAGE_SIZE 4096

    int main(void)
    {
    char *p;
    int i;

    p = mmap(NULL, 2 * PAGE_SIZE, PROT_READ|PROT_WRITE,
    MAP_SHARED|MAP_ANONYMOUS, -1, PAGE_SIZE);
    if (p == MAP_FAILED) {
    perror("mmap");
    return 1;
    }

    for (i = 0; i < 2; i++) {
    printf("page %d\n", i);
    p[i * 4096] = i;
    }
    return 0;
    }

    Fix it.

    Signed-off-by: Tejun Heo
    Acked-by: Hugh Dickins
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

12 Aug, 2008

1 commit


11 Aug, 2008

2 commits

  • Lockdep spotted:

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.27-rc1 #270
    -------------------------------------------------------
    qemu-kvm/2033 is trying to acquire lock:
    (&inode->i_data.i_mmap_lock){----}, at: [] mm_take_all_locks+0xc2/0xea

    but task is already holding lock:
    (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&anon_vma->lock){----}:
    [] __lock_acquire+0x11be/0x14d2
    [] lock_acquire+0x5e/0x7a
    [] _spin_lock+0x3b/0x47
    [] vma_adjust+0x200/0x444
    [] split_vma+0x12f/0x146
    [] mprotect_fixup+0x13c/0x536
    [] sys_mprotect+0x1a9/0x21e
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    -> #0 (&inode->i_data.i_mmap_lock){----}:
    [] __lock_acquire+0xedb/0x14d2
    [] lock_release_non_nested+0x1c2/0x219
    [] lock_release+0x127/0x14a
    [] _spin_unlock+0x1e/0x50
    [] mm_drop_all_locks+0x7f/0xb0
    [] do_mmu_notifier_register+0xe2/0x112
    [] mmu_notifier_register+0xe/0x10
    [] kvm_dev_ioctl+0x11e/0x287 [kvm]
    [] vfs_ioctl+0x2a/0x78
    [] do_vfs_ioctl+0x257/0x274
    [] sys_ioctl+0x55/0x78
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    other info that might help us debug this:

    5 locks held by qemu-kvm/2033:
    #0: (&mm->mmap_sem){----}, at: [] do_mmu_notifier_register+0x55/0x112
    #1: (mm_all_locks_mutex){--..}, at: [] mm_take_all_locks+0x34/0xea
    #2: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
    #3: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
    #4: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea

    stack backtrace:
    Pid: 2033, comm: qemu-kvm Not tainted 2.6.27-rc1 #270

    Call Trace:
    [] print_circular_bug_tail+0xb8/0xc3
    [] __lock_acquire+0xedb/0x14d2
    [] ? add_lock_to_list+0x7e/0xad
    [] ? mm_take_all_locks+0x70/0xea
    [] ? mm_take_all_locks+0x70/0xea
    [] lock_release_non_nested+0x1c2/0x219
    [] ? mm_take_all_locks+0xc2/0xea
    [] ? mm_take_all_locks+0xc2/0xea
    [] ? trace_hardirqs_on_caller+0x4d/0x115
    [] ? mm_drop_all_locks+0x7f/0xb0
    [] lock_release+0x127/0x14a
    [] _spin_unlock+0x1e/0x50
    [] mm_drop_all_locks+0x7f/0xb0
    [] do_mmu_notifier_register+0xe2/0x112
    [] mmu_notifier_register+0xe/0x10
    [] kvm_dev_ioctl+0x11e/0x287 [kvm]
    [] ? file_has_perm+0x83/0x8e
    [] vfs_ioctl+0x2a/0x78
    [] do_vfs_ioctl+0x257/0x274
    [] sys_ioctl+0x55/0x78
    [] system_call_fastpath+0x16/0x1b

    Which the locking hierarchy in mm/rmap.c confirms as valid.

    Fix this by first taking all the mapping->i_mmap_lock instances and then
    take all anon_vma->lock instances.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The nesting is correct due to holding mmap_sem, use the new annotation
    to annotate this.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

06 Aug, 2008

1 commit

  • gcc 4.3.0 correctly emits the following warnings.
    When a vma covering addr is found, find_vma_prepare indeed returns without
    setting pprev, rb_link, and rb_parent.

    mm/mmap.c: In function `insert_vm_struct':
    mm/mmap.c:2085: warning: `rb_parent' may be used uninitialized in this function
    mm/mmap.c:2085: warning: `rb_link' may be used uninitialized in this function
    mm/mmap.c:2084: warning: `prev' may be used uninitialized in this function
    mm/mmap.c: In function `copy_vma':
    mm/mmap.c:2124: warning: `rb_parent' may be used uninitialized in this function
    mm/mmap.c:2124: warning: `rb_link' may be used uninitialized in this function
    mm/mmap.c:2123: warning: `prev' may be used uninitialized in this function
    mm/mmap.c: In function `do_brk':
    mm/mmap.c:1951: warning: `rb_parent' may be used uninitialized in this function
    mm/mmap.c:1951: warning: `rb_link' may be used uninitialized in this function
    mm/mmap.c:1949: warning: `prev' may be used uninitialized in this function
    mm/mmap.c: In function `mmap_region':
    mm/mmap.c:1092: warning: `rb_parent' may be used uninitialized in this function
    mm/mmap.c:1092: warning: `rb_link' may be used uninitialized in this function
    mm/mmap.c:1089: warning: `prev' may be used uninitialized in this function

    Hugh adds: in fact, none of find_vma_prepare's callers use those values
    when a vma is found to be already covering addr, it's either an error or
    an occasion to munmap and repeat. Okay, let's quieten the compiler (but I
    would prefer it if pprev, rb_link and rb_parent were meaningful in that
    case, rather than whatever's in them from descending the tree).

    Signed-off-by: Benny Halevy
    Signed-off-by: Hugh Dickins
    Cc: "Ryan Hope"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benny Halevy
     

29 Jul, 2008

2 commits

  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • mm_take_all_locks holds off reclaim from an entire mm_struct. This allows
    mmu notifiers to register into the mm at any time with the guarantee that
    no mmu operation is in progress on the mm.

    This operation locks against the VM for all pte/vma/mm related operations
    that could ever happen on a certain mm. This includes vmtruncate,
    try_to_unmap, and all page faults.

    The caller must take the mmap_sem in write mode before calling
    mm_take_all_locks(). The caller isn't allowed to release the mmap_sem
    until mm_drop_all_locks() returns.

    mmap_sem in write mode is required in order to block all operations that
    could modify pagetables and free pages without need of altering the vma
    layout (for example populate_range() with nonlinear vmas). It's also
    needed in write mode to avoid new anon_vmas to be associated with existing
    vmas.

    A single task can't take more than one mm_take_all_locks() in a row or it
    would deadlock.

    mm_take_all_locks() and mm_drop_all_locks are expensive operations that
    may have to take thousand of locks.

    mm_take_all_locks() can fail if it's interrupted by signals.

    When mmu_notifier_register returns, we must be sure that the driver is
    notified if some task is in the middle of a vmtruncate for the 'mm' where
    the mmu notifier was registered (mmu_notifier_invalidate_range_start/end
    is run around the vmtruncation but mmu_notifier_register can run after
    mmu_notifier_invalidate_range_start and before
    mmu_notifier_invalidate_range_end). Same problem for rmap paths. And
    we've to remove page pinning to avoid replicating the tlb_gather logic
    inside KVM (and GRU doesn't work well with page pinning regardless of
    needing tlb_gather), so without mm_take_all_locks when vmtruncate frees
    the page, kvm would have no way to notice that it mapped into sptes a page
    that is going into the freelist without a chance of any further
    mmu_notifier notification.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Linus Torvalds
    Cc: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Jul, 2008

3 commits

  • The goal of this patchset is to support multiple hugetlb page sizes. This
    is achieved by introducing a new struct hstate structure, which
    encapsulates the important hugetlb state and constants (eg. huge page
    size, number of huge pages currently allocated, etc).

    The hstate structure is then passed around the code which requires these
    fields, they will do the right thing regardless of the exact hstate they
    are operating on.

    This patch adds the hstate structure, with a single global instance of it
    (default_hstate), and does the basic work of converting hugetlb to use the
    hstate.

    Future patches will add more hstate structures to allow for different
    hugetlbfs mounts to have different page sizes.

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • With Mel's hugetlb private reservation support patches applied, strict
    overcommit semantics are applied to both shared and private huge page
    mappings. This can be a problem if an application relied on unlimited
    overcommit semantics for private mappings. An example of this would be an
    application which maps a huge area with the intention of using it very
    sparsely. These application would benefit from being able to opt-out of
    the strict overcommit. It should be noted that prior to hugetlb
    supporting demand faulting all mappings were fully populated and so
    applications of this type should be rare.

    This patch stack implements the MAP_NORESERVE mmap() flag for huge page
    mappings. This flag has the same meaning as for small page mappings,
    suppressing reservations for that mapping.

    Thanks to Mel Gorman for reviewing a number of early versions of these
    patches.

    This patch:

    When a small page mapping is created with mmap() reservations are created
    by default for any memory pages required. When the region is read/write
    the reservation is increased for every page, no reservation is needed for
    read-only regions (as they implicitly share the zero page). Reservations
    are tracked via the VM_ACCOUNT vma flag which is present when the region
    has reservation backing it. When we convert a region from read-only to
    read-write new reservations are aquired and VM_ACCOUNT is set. However,
    when a read-only map is created with MAP_NORESERVE it is indistinguishable
    from a normal mapping. When we then convert that to read/write we are
    forced to incorrectly create reservations for it as we have no record of
    the original MAP_NORESERVE.

    This patch introduces a new vma flag VM_NORESERVE which records the
    presence of the original MAP_NORESERVE flag. This allows us to
    distinguish these two circumstances and correctly account the reserve.

    As well as fixing this FIXME in the code, this makes it much easier to
    introduce MAP_NORESERVE support for huge pages as this flag is available
    consistantly for the life of the mapping. VM_ACCOUNT on the other hand is
    heavily used at the generic level in association with small pages.

    Signed-off-by: Andy Whitcroft
    Cc: Mel Gorman
    Cc: Adam Litke
    Cc: Johannes Weiner
    Cc: Andy Whitcroft
    Cc: William Lee Irwin III
    Cc: Hugh Dickins
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • The double indirection here is not needed anywhere and hence (at least)
    confusing.

    Signed-off-by: Jan Beulich
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Luck, Tony"
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Acked-by: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     

09 Jul, 2008

1 commit

  • This patch allows architectures to define functions to deal with
    additional protections bits for mmap() and mprotect().

    arch_calc_vm_prot_bits() maps additonal protection bits to vm_flags
    arch_vm_get_page_prot() maps additional vm_flags to the vma's vm_page_prot
    arch_validate_prot() checks for valid values of the protection bits

    Note: vm_get_page_prot() is now pretty ugly, but the generated code
    should be identical for architectures that don't define additional
    protection bits.

    Signed-off-by: Dave Kleikamp
    Acked-by: Andrew Morton
    Acked-by: Hugh Dickins
    Signed-off-by: Benjamin Herrenschmidt

    Dave Kleikamp
     

07 Jun, 2008

1 commit

  • Fix a regression introduced by

    commit 4cc6028d4040f95cdb590a87db478b42b8be0508
    Author: Jiri Kosina
    Date: Wed Feb 6 22:39:44 2008 +0100

    brk: check the lower bound properly

    The check in sys_brk() on minimum value the brk might have must take
    CONFIG_COMPAT_BRK setting into account. When this option is turned on
    (i.e. we support ancient legacy binaries, e.g. libc5-linked stuff), the
    lower bound on brk value is mm->end_code, otherwise the brk start is
    allowed to be arbitrarily shifted.

    Signed-off-by: Jiri Kosina
    Tested-by: Geert Uytterhoeven
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     

25 May, 2008

1 commit

  • The atomic_t type is 32bit but a 64bit system can have more than 2^32
    pages of virtual address space available. Without this we overflow on
    ludicrously large mappings

    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     

29 Apr, 2008

1 commit

  • The kernel implements readlink of /proc/pid/exe by getting the file from
    the first executable VMA. Then the path to the file is reconstructed and
    reported as the result.

    Because of the VMA walk the code is slightly different on nommu systems.
    This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
    walking the VMAs to find the first executable file-backed VMA we store a
    reference to the exec'd file in the mm_struct.

    That reference would prevent the filesystem holding the executable file
    from being unmounted even after unmapping the VMAs. So we track the number
    of VM_EXECUTABLE VMAs and drop the new reference when the last one is
    unmapped. This avoids pinning the mounted filesystem.

    [akpm@linux-foundation.org: improve comments]
    [yamamoto@valinux.co.jp: fix dup_mmap]
    Signed-off-by: Matt Helsley
    Cc: Oleg Nesterov
    Cc: David Howells
    Cc:"Eric W. Biederman"
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Hugh Dickins
    Signed-off-by: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     

28 Apr, 2008

3 commits

  • This patch renames mpol_copy() to mpol_dup() because, well, that's what it
    does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
    existing mempolicy, allocates a new one and copies the contents.

    In a later patch, I want to use the name mpol_copy() to copy the contents from
    one mempolicy to another like, e.g., strcpy() does for strings.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This is a change that was requested some time ago by Mel Gorman. Makes sense
    to me, so here it is.

    Note: I retain the name "mpol_free_shared_policy()" because it actually does
    free the shared_policy, which is NOT a reference counted object. However, ...

    The mempolicy object[s] referenced by the shared_policy are reference counted,
    so mpol_put() is used to release the reference held by the shared_policy. The
    mempolicy might not be freed at this time, because some task attached to the
    shared object associated with the shared policy may be in the process of
    allocating a page based on the mempolicy. In that case, the task performing
    the allocation will hold a reference on the mempolicy, obtained via
    mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
    holding such a reference have called mpol_put() for the mempolicy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • It is not easy to actually understand the "if (!file || !vma_merge())"
    code, turn it into "if (file && vma_merge())". This makes immediately
    obvious that the subsequent "if (file)" is superfluous.

    As Hugh Dickins pointed out, we can also factor out the ->i_writecount
    corrections, and add a small comment about that.

    Signed-off-by: Oleg Nesterov
    Cc: Miklos Szeredi
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

09 Feb, 2008

1 commit

  • Convert special mapping install from nopage to fault.

    Because the "vm_file" is NULL for the special mapping, the generic VM
    code has messed up "vm_pgoff" thinking that it's an anonymous mapping
    and the offset does't matter. For that reason, we need to undo the
    vm_pgoff offset that got added into vmf->pgoff.

    [ We _really_ should clean that up - either by making this whole special
    mapping code just use a real file entry rather than that ugly array of
    "struct page" pointers, or by just making the VM code realize that
    even if vm_file is NULL it may not be a regular anonymous mmap.
    - Linus ]

    Signed-off-by: Nick Piggin
    Cc: linux-mm@kvack.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

07 Feb, 2008

1 commit

  • There is a check in sys_brk(), that tries to make sure that we do not
    underflow the area that is dedicated to brk heap.

    The check is however wrong, as it assumes that brk area starts immediately
    after the end of the code (+bss), which is wrong for example in
    environments with randomized brk start. The proper way is to check whether
    the address is not below the start_brk address.

    Signed-off-by: Jiri Kosina
    Signed-off-by: Ingo Molnar

    Jiri Kosina
     

06 Feb, 2008

1 commit

  • In order to change the layout of the page tables after an mmap has crossed the
    adress space limit of the current page table layout a architecture hook in
    get_unmapped_area is needed. The arguments are the address of the new mapping
    and the length of it.

    Cc: Benjamin Herrenschmidt
    Signed-off-by: Martin Schwidefsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     

04 Feb, 2008

1 commit

  • Drivers that register a ->fault handler, but do not range-check the
    offset argument, must set VM_DONTEXPAND in the vm_flags in order to
    prevent an expanding mremap from overflowing the resource.

    I've audited the tree and attempted to fix these problems (usually by
    adding VM_DONTEXPAND where it is not obvious).

    Signed-off-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

30 Jan, 2008

1 commit

  • Randomize the location of the heap (brk) for i386 and x86_64. The range is
    randomized in the range starting at current brk location up to 0x02000000
    offset for both architectures. This, together with
    pie-executable-randomization.patch and
    pie-executable-randomization-fix.patch, should make the address space
    randomization on i386 and x86_64 complete.

    Arjan says:

    This is known to break older versions of some emacs variants, whose dumper
    code assumed that the last variable declared in the program is equal to the
    start of the dynamically allocated memory region.

    (The dumper is the code where emacs effectively dumps core at the end of it's
    compilation stage; this coredump is then loaded as the main program during
    normal use)

    iirc this was 5 years or so; we found this way back when I was at RH and we
    first did the security stuff there (including this brk randomization). It
    wasn't all variants of emacs, and it got fixed as a result (I vaguely remember
    that emacs already had code to deal with it for other archs/oses, just
    ifdeffed wrongly).

    It's a rare and wrong assumption as a general thing, just on x86 it mostly
    happened to be true (but to be honest, it'll break too if gcc does
    something fancy or if the linker does a non-standard order). Still its
    something we should at least document.

    Note 2: afaik it only broke the emacs *build*. I'm not 100% sure about that
    (it IS 5 years ago) though.

    [ akpm@linux-foundation.org: deuglification ]

    Signed-off-by: Jiri Kosina
    Cc: Arjan van de Ven
    Cc: Roland McGrath
    Cc: Jakub Jelinek
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Jiri Kosina
     

25 Jan, 2008

1 commit


05 Dec, 2007

3 commits

  • Given a specifically crafted binary do_brk() can be used to get low
    pages available in userspace virtually memory and can thus be used to
    circumvent the mmap_min_addr low memory protection. Add security checks
    in do_brk().

    Signed-off-by: Eric Paris
    Acked-by: Alan Cox
    Signed-off-by: James Morris

    Eric Paris
     
  • If mmap_min_addr is set and a process attempts to mmap (not fixed) with a
    non-null hint address less than mmap_min_addr the mapping will fail the
    security checks. Since this is just a hint address this patch will round
    such a hint address above mmap_min_addr.

    gcj was found to try to be very frugal with vm usage and give hint addresses
    in the 8k-32k range. Without this patch all such programs failed and with
    the patch they happily get a higher address.

    This patch is wrappad in CONFIG_SECURITY since mmap_min_addr doesn't exist
    without it and there would be no security check possible no matter what. So
    we should not bother compiling in this rounding if it is just a waste of
    time.

    Signed-off-by: Eric Paris
    Signed-off-by: James Morris

    Eric Paris
     
  • Add security checks to make sure we are not attempting to expand the
    stack into memory protected by mmap_min_addr

    Signed-off-by: Eric Paris
    Signed-off-by: James Morris

    Eric Paris
     

23 Oct, 2007

1 commit

  • Fix mprotect bug in recent commit 3ed75eb8f1cd89565966599c4f77d2edb086d5b0
    (setup vma->vm_page_prot by vm_get_page_prot()): the vma_wants_writenotify
    case was setting the same prot as when not.

    Nothing wrong with the use of protection_map[] in mmap_region(),
    but use vm_get_page_prot() there too in the same ~VM_SHARED way.

    Signed-off-by: Hugh Dickins
    Cc: Coly Li
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

20 Oct, 2007

1 commit

  • This patch uses vm_get_page_prot() to setup vma->vm_page_prot.

    Though inside vm_get_page_prot() the protection flags is AND with
    (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED), it does not hurt correct code.

    Signed-off-by: Coly Li
    Cc: Hugh Dickins
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Coly Li
     

17 Oct, 2007

2 commits

  • This patch contains the following cleanups that are now possible:
    - remove the unused security_operations->inode_xattr_getsuffix
    - remove the no longer used security_operations->unregister_security
    - remove some no longer required exit code
    - remove a bunch of no longer used exports

    Signed-off-by: Adrian Bunk
    Acked-by: James Morris
    Cc: Chris Wright
    Cc: Stephen Smalley
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • mm.h doesn't use directly anything from mutex.h and backing-dev.h, so
    remove them and add them back to files which need them.

    Cross-compile tested on many configs and archs.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

23 Aug, 2007

1 commit

  • The new exec code inserts an accounted vma into an mm struct which is not
    current->mm. The existing memory check code has a hard coded assumption
    that this does not happen as does the security code.

    As the correct mm is known we pass the mm to the security method and the
    helper function. A new security test is added for the case where we need
    to pass the mm and the existing one is modified to pass current->mm to
    avoid the need to change large amounts of code.

    (Thanks to Tobias for fixing rejects and testing)

    Signed-off-by: Alan Cox
    Cc: WU Fengguang
    Cc: James Morris
    Cc: Tobias Diedrich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     

30 Jul, 2007

1 commit

  • Remove fs.h from mm.h. For this,
    1) Uninline vma_wants_writenotify(). It's pretty huge anyway.
    2) Add back fs.h or less bloated headers (err.h) to files that need it.

    As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files
    rebuilt down to 3444 (-12.3%).

    Cross-compile tested without regressions on my two usual configs and (sigh):

    alpha arm-mx1ads mips-bigsur powerpc-ebony
    alpha-allnoconfig arm-neponset mips-capcella powerpc-g5
    alpha-defconfig arm-netwinder mips-cobalt powerpc-holly
    alpha-up arm-netx mips-db1000 powerpc-iseries
    arm arm-ns9xxx mips-db1100 powerpc-linkstation
    arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200
    arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple
    arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2
    arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads
    arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb
    arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds
    arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb
    arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx
    arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp
    arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds
    arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds
    arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads
    arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds
    arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads
    arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds
    arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds
    arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn
    arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads
    arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads
    arm-ep93xx i386-up mips-pb1100 powerpc-pasemi
    arm-footbridge ia64 mips-pb1500 powerpc-pmac32
    arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64
    arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800
    arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3
    arm-h7202 ia64-gensparse mips-qemu powerpc-pseries
    arm-hackkit ia64-sim mips-rbhma4200 powerpc-up
    arm-integrator ia64-sn2 mips-rbhma4500 s390
    arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig
    arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig
    arm-iop33x ia64-zx1 mips-sead s390-up
    arm-ixp2000 m68k mips-tb0219 sparc
    arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig
    arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig
    arm-jornada720 m68k-atari mips-workpad sparc-up
    arm-kafa m68k-bvme6000 mips-wrppmc sparc64
    arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig
    arm-ks8695 m68k-mac parisc sparc64-defconfig
    arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up
    arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64
    arm-lpd7a400 m68k-q40 parisc-up x86_64
    arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig
    arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig
    arm-lusl7200 mips powerpc-celleb x86_64-up
    arm-mainstone mips-atlas powerpc-chrp32

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

20 Jul, 2007

2 commits

  • Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
    the old mm into the new mm.

    We create the new mm before the binfmt code runs, and place the new stack at
    the very top of the address space. Once the binfmt code runs and figures out
    where the stack should be, we move it downwards.

    It is a bit peculiar in that we have one task with two mm's, one of which is
    inactive.

    [a.p.zijlstra@chello.nl: limit stack size]
    Signed-off-by: Ollie Wild
    Signed-off-by: Peter Zijlstra
    Cc:
    Cc: Hugh Dickins
    [bunk@stusta.de: unexport bprm_mm_init]
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ollie Wild
     
  • Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
    the virtual address -> file offset differently from linear mappings.

    ->populate is a layering violation because the filesystem/pagecache code
    should need to know anything about the virtual memory mapping. The hitch here
    is that the ->nopage handler didn't pass down enough information (ie. pgoff).
    But it is more logical to pass pgoff rather than have the ->nopage function
    calculate it itself anyway (because that's a similar layering violation).

    Having the populate handler install the pte itself is likewise a nasty thing
    to be doing.

    This patch introduces a new fault handler that replaces ->nopage and
    ->populate and (later) ->nopfn. Most of the old mechanism is still in place
    so there is a lot of duplication and nice cleanups that can be removed if
    everyone switches over.

    The rationale for doing this in the first place is that nonlinear mappings are
    subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
    to duplicate the synchronisation logic rather than just consolidate the two.

    After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
    pagecache. Seems like a fringe functionality anyway.

    NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
    users have hit mainline yet.

    [akpm@linux-foundation.org: cleanup]
    [randy.dunlap@oracle.com: doc. fixes for readahead]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Randy Dunlap
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

17 Jul, 2007

1 commit

  • This is a straightforward split of do_mmap_pgoff() into two functions:

    - do_mmap_pgoff() checks the parameters, and calculates the vma
    flags. Then it calls

    - mmap_region(), which does the actual mapping

    Signed-off-by: Miklos Szeredi
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

12 Jul, 2007

1 commit

  • Add a new security check on mmap operations to see if the user is attempting
    to mmap to low area of the address space. The amount of space protected is
    indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
    0, preserving existing behavior.

    This patch uses a new SELinux security class "memprotect." Policy already
    contains a number of allow rules like a_t self:process * (unconfined_t being
    one of them) which mean that putting this check in the process class (its
    best current fit) would make it useless as all user processes, which we also
    want to protect against, would be allowed. By taking the memprotect name of
    the new class it will also make it possible for us to move some of the other
    memory protect permissions out of 'process' and into the new class next time
    we bump the policy version number (which I also think is a good future idea)

    Acked-by: Stephen Smalley
    Acked-by: Chris Wright
    Signed-off-by: Eric Paris
    Signed-off-by: James Morris

    Eric Paris
     

22 Jun, 2007

1 commit

  • Function expand_upwards() did not guarded against wrapping
    around to address 0. This fixes the adjtimex02 testcase from
    the Linux Test Project on a 32bit PARISC kernel.

    [expand_upwards is only used on parisc and ia64; it looks like it does
    the right thing on both. --kyle]

    Signed-off-by: Helge Deller
    Cc: Tony Luck
    Signed-off-by: Kyle McMartin

    Helge Deller
     

09 May, 2007

2 commits


08 May, 2007

1 commit

  • Remove the hugetlbfs specific hacks in toplevel get_unmapped_area() now that
    all archs and hugetlbfs itself do the right thing for both cases.

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: William Irwin
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: David Howells
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Grant Grundler
    Cc: Matthew Wilcox
    Cc: "David S. Miller"
    Cc: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt