05 Aug, 2008

2 commits

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Halesh says:

    Please find the below testcase provide to test mlock.

    Test Case :
    ===========================

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    int fd,ret, i = 0;
    char *addr, *addr1 = NULL;
    unsigned int page_size;
    struct rlimit rlim;

    if (0 != geteuid())
    {
    printf("Execute this pgm as root\n");
    exit(1);
    }

    /* create a file */
    if ((fd = open("mmap_test.c",O_RDWR|O_CREAT,0755)) == -1)
    {
    printf("cant create test file\n");
    exit(1);
    }

    page_size = sysconf(_SC_PAGE_SIZE);

    /* set the MEMLOCK limit */
    rlim.rlim_cur = 2000;
    rlim.rlim_max = 2000;

    if ((ret = setrlimit(RLIMIT_MEMLOCK,&rlim)) != 0)
    {
    printf("Cant change limit values\n");
    exit(1);
    }

    addr = 0;
    while (1)
    {
    /* map a page into memory each time*/
    if ((addr = (char *) mmap(addr,page_size, PROT_READ |
    PROT_WRITE,MAP_SHARED,fd,0)) == MAP_FAILED)
    {
    printf("cant do mmap on file\n");
    exit(1);
    }

    if (0 == i)
    addr1 = addr;
    i++;
    errno = 0;
    /* lock the mapped memory pagewise*/
    if ((ret = mlock((char *)addr, 1500)) == -1)
    {
    printf("errno value is %d\n", errno);
    printf("cant lock maped region\n");
    exit(1);
    }
    addr = addr + page_size;
    }
    }
    ======================================================

    This testcase results in an mlock() failure with errno 14 that is EFAULT,
    but it has nowhere been specified that mlock() will return EFAULT. When I
    tested the same on older kernels like 2.6.18, I got the correct result i.e
    errno 12 (ENOMEM).

    I think in source code mlock(2), setting errno ENOMEM has been missed in
    do_mlock() , on mlock_fixup() failure.

    SUSv3 requires the following behavior frmo mlock(2).

    [ENOMEM]
    Some or all of the address range specified by the addr and
    len arguments does not correspond to valid mapped pages
    in the address space of the process.

    [EAGAIN]
    Some or all of the memory identified by the operation could not
    be locked when the call was made.

    This rule isn't so nice and slighly strange. but many people think
    POSIX/SUS compliance is important.

    Reported-by: Halesh Sadashiv
    Tested-by: Halesh Sadashiv
    Signed-off-by: KOSAKI Motohiro
    Cc: [2.6.25.x, 2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

02 Aug, 2008

1 commit


31 Jul, 2008

2 commits


29 Jul, 2008

1 commit

  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Jul, 2008

1 commit


25 Jul, 2008

7 commits

  • Straight forward extensions for huge pages located in the PUD instead of
    PMDs.

    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Add the ability to configure the hugetlb hstate used on a per mount basis.

    - Add a new pagesize= option to the hugetlbfs mount that allows setting
    the page size
    - This option causes the mount code to find the hstate corresponding to the
    specified size, and sets up a pointer to the hstate in the mount's
    superblock.
    - Change the hstate accessors to use this information rather than the
    global_hstate they were using (requires a slight change in mm/memory.c
    so we don't NULL deref in the error-unmap path -- see comments).

    [np: take hstate out of hugetlbfs inode and vma->vm_private_data]

    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • The goal of this patchset is to support multiple hugetlb page sizes. This
    is achieved by introducing a new struct hstate structure, which
    encapsulates the important hugetlb state and constants (eg. huge page
    size, number of huge pages currently allocated, etc).

    The hstate structure is then passed around the code which requires these
    fields, they will do the right thing regardless of the exact hstate they
    are operating on.

    This patch adds the hstate structure, with a single global instance of it
    (default_hstate), and does the basic work of converting hugetlb to use the
    hstate.

    Future patches will add more hstate structures to allow for different
    hugetlbfs mounts to have different page sizes.

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • …n hugetlbfs will succeed

    After patch 2 in this series, a process that successfully calls mmap() for
    a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
    process calls fork(). At that point, the next write fault from the parent
    could fail due to COW if the child still has a reference.

    We only reserve pages for the parent but a copy must be made to avoid
    leaking data from the parent to the child after fork(). Reserves could be
    taken for both parent and child at fork time to guarantee faults but if
    the mapping is large it is highly likely we will not have sufficient pages
    for the reservation, and it is common to fork only to exec() immediatly
    after. A failure here would be very undesirable.

    Note that the current behaviour of mainline with MAP_PRIVATE pages is
    pretty bad. The following situation is allowed to occur today.

    1. Process calls mmap(MAP_PRIVATE)
    2. Process calls mlock() to fault all pages and makes sure it succeeds
    3. Process forks()
    4. Process writes to MAP_PRIVATE mapping while child still exists
    5. If the COW fails at this point, the process gets SIGKILLed even though it
    had taken care to ensure the pages existed

    This patch improves the situation by guaranteeing the reliability of the
    process that successfully calls mmap(). When the parent performs COW, it
    will try to satisfy the allocation without using reserves. If that fails
    the parent will steal the page leaving any children without a page.
    Faults from the child after that point will result in failure. If the
    child COW happens first, an attempt will be made to allocate the page
    without reserves and the child will get SIGKILLed on failure.

    To summarise the new behaviour:

    1. If the original mapper performs COW on a private mapping with multiple
    references, it will attempt to allocate a hugepage from the pool or
    the buddy allocator without using the existing reserves. On fail, VMAs
    mapping the same area are traversed and the page being COW'd is unmapped
    where found. It will then steal the original page as the last mapper in
    the normal way.

    2. The VMAs the pages were unmapped from are flagged to note that pages
    with data no longer exist. Future no-page faults on those VMAs will
    terminate the process as otherwise it would appear that data was corrupted.
    A warning is printed to the console that this situation occured.

    2. If the child performs COW first, it will attempt to satisfy the COW
    from the pool if there are enough pages or via the buddy allocator if
    overcommit is allowed and the buddy allocator can satisfy the request. If
    it fails, the child will be killed.

    If the pool is large enough, existing applications will not notice that
    the reserves were a factor. Existing applications depending on the
    no-reserves been set are unlikely to exist as for much of the history of
    hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
    point or failing the mmap().

    [npiggin@suse.de: fix CONFIG_HUGETLB=n build]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Acked-by: Adam Litke <agl@us.ibm.com>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Cc: William Lee Irwin III <wli@holomorphy.com>
    Cc: Hugh Dickins <hugh@veritas.com>
    Cc: Nick Piggin <npiggin@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • The double indirection here is not needed anywhere and hence (at least)
    confusing.

    Signed-off-by: Jan Beulich
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Luck, Tony"
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Acked-by: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • In order to be able to debug things like the X server and programs using
    the PPC Cell SPUs, the debugger needs to be able to access device memory
    through ptrace and /proc/pid/mem.

    This patch:

    Add the generic_access_phys access function and put the hooks in place
    to allow access_process_vm to access device or PPC Cell SPU memory.

    [riel@redhat.com: Add documentation for the vm_ops->access function]
    Signed-off-by: Rik van Riel
    Signed-off-by: Benjamin Herrensmidt
    Cc: Dave Airlie
    Cc: Hugh Dickins
    Cc: Paul Mackerras
    Cc: Arnd Bergmann
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • There are no users of nopfn in the tree. Remove it.

    [hugh@veritas.com: fix build error]
    Signed-off-by: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

05 Jul, 2008

2 commits

  • get_user_pages() must not return the error when i != 0. When pages !=
    NULL we have i get_page()'ed pages.

    Signed-off-by: Oleg Nesterov
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Dirty page accounting accurately measures the amound of dirty pages in
    writable shared mappings by mapping the pages RO (as indicated by
    vma_wants_writenotify). We then trap on first write and call
    set_page_dirty() on the page, after which we map the page RW and
    continue execution.

    When we launder dirty pages, we call clear_page_dirty_for_io() which
    clears both the dirty flag, and maps the page RO again before we start
    writeout so that the story can repeat itself.

    vma_wants_writenotify() excludes VM_PFNMAP on the basis that we cannot
    do the regular dirty page stuff on raw PFNs and the memory isn't going
    anywhere anyway.

    The recently introduced VM_MIXEDMAP mixes both !pfn_valid() and
    pfn_valid() pages in a single mapping.

    We can't do dirty page accounting on !pfn_valid() pages as stated
    above, and mapping them RO causes them to be COW'ed on write, which
    breaks VM_SHARED semantics.

    Excluding VM_MIXEDMAP in vma_wants_writenotify() would mean we don't do
    the regular dirty page accounting for the pfn_valid() pages, which
    would bring back all the head-aches from inaccurate dirty page
    accounting.

    So instead, we let the !pfn_valid() pages get mapped RO, but fix them
    up unconditionally in the fault path.

    Signed-off-by: Peter Zijlstra
    Cc: Nick Piggin
    Acked-by: Hugh Dickins
    Cc: "Jared Hulbert"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

24 Jun, 2008

2 commits

  • There is a race in the COW logic. It contains a shortcut to avoid the
    COW and reuse the page if we have the sole reference on the page,
    however it is possible to have two racing do_wp_page()ers with one
    causing the other to mistakenly believe it is safe to take the shortcut
    when it is not. This could lead to data corruption.

    Process 1 and process2 each have a wp pte of the same anon page (ie.
    one forked the other). The page's mapcount is 2. Then they both
    attempt to write to it around the same time...

    proc1 proc2 thr1 proc2 thr2
    CPU0 CPU1 CPU3
    do_wp_page() do_wp_page()
    trylock_page()
    can_share_swap_page()
    load page mapcount (==2)
    reuse = 0
    pte unlock
    copy page to new_page
    pte lock
    page_remove_rmap(page);
    trylock_page()
    can_share_swap_page()
    load page mapcount (==1)
    reuse = 1
    ptep_set_access_flags (allow W)

    write private key into page
    read from page
    ptep_clear_flush()
    set_pte_at(pte of new_page)

    Fix this by moving the page_remove_rmap of the old page after the pte
    clear and flush. Potentially the entire branch could be moved down
    here, but in order to stay consistent, I won't (should probably move all
    the *_mm_counter stuff with one patch).

    Signed-off-by: Nick Piggin
    Acked-by: Hugh Dickins
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f ("Reinstate ZERO_PAGE
    optimization in 'get_user_pages()' and fix XIP") broke vmware, as
    reported by Jeff Chua:

    "This broke vmware 6.0.4.
    Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED
    /build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774"

    and the reason seems to be that there's an old bug in how we handle do
    FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only
    triggered if the whole page table was missing, nobody had apparently hit
    it before.

    The recent changes to 'follow_page()' made the FOLL_ANON logic trigger
    not just for whole missing page tables, but for individual pages as
    well, and exposed this problem.

    This fixes it by making the test for when FOLL_ANON is used more
    careful, and also makes the code easier to read and understand by moving
    the logic to a separate inline function.

    Reported-and-tested-by: Jeff Chua
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Jun, 2008

1 commit

  • KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
    557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed
    the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
    generally now populate the VM with real empty pages needlessly.

    We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
    since fault handling no longer uses ZERO_PAGE for new anonymous pages,
    we now need to handle that special case in follow_page() instead.

    In particular, the removal of ZERO_PAGE effectively removed the core
    file writing optimization where we would skip writing pages that had not
    been populated at all, and increased memory pressure a lot by allocating
    all those useless newly zeroed pages.

    This reinstates the optimization by making the unmapped PTE case the
    same as for a non-existent page table, which already did this correctly.

    While at it, this also fixes the XIP case for follow_page(), where the
    caller could not differentiate between the case of a page that simply
    could not be used (because it had no "struct page" associated with it)
    and a page that just wasn't mapped.

    We do that by simply returning an error pointer for pages that could not
    be turned into a "struct page *". The error is arbitrarily picked to be
    EFAULT, since that was what get_user_pages() already used for the
    equivalent IO-mapped page case.

    [ Also removed an impossible test for pte_offset_map_lock() failing:
    that's not how that function works ]

    Acked-by: Oleg Nesterov
    Acked-by: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

25 May, 2008

1 commit

  • Take out an assertion to allow ->fault handlers to service PFNMAP regions.
    This is required to reimplement .nopfn handlers with .fault handlers and
    subsequently remove nopfn.

    Signed-off-by: Nick Piggin
    Acked-by: Jes Sorensen
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

15 May, 2008

1 commit

  • There is a possible data race in the page table walking code. After the split
    ptlock patches, it actually seems to have been introduced to the core code, but
    even before that I think it would have impacted some architectures (powerpc
    and sparc64, at least, walk the page tables without taking locks eg. see
    find_linux_pte()).

    The race is as follows:
    The pte page is allocated, zeroed, and its struct page gets its spinlock
    initialized. The mm-wide ptl is then taken, and then the pte page is inserted
    into the pagetables.

    At this point, the spinlock is not guaranteed to have ordered the previous
    stores to initialize the pte page with the subsequent store to put it in the
    page tables. So another Linux page table walker might be walking down (without
    any locks, because we have split-leaf-ptls), and find that new pte we've
    inserted. It might try to take the spinlock before the store from the other
    CPU initializes it. And subsequently it might read a pte_t out before stores
    from the other CPU have cleared the memory.

    There are also similar races in higher levels of the page tables. They
    obviously don't involve the spinlock, but could see uninitialized memory.

    Arch code and hardware pagetable walkers that walk the pagetables without
    locks could see similar uninitialized memory problems, regardless of whether
    split ptes are enabled or not.

    I prefer to put the barriers in core code, because that's where the higher
    level logic happens, but the page table accessors are per-arch, and open-coding
    them everywhere I don't think is an option. I'll put the read-side barriers
    in alpha arch code for now (other architectures perform data-dependent loads
    in order).

    Signed-off-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

07 May, 2008

1 commit

  • Fix warning from pmd_bad() at bootup on a HIGHMEM64G HIGHPTE x86_32.

    That came from 9fc34113f6880b215cbea4e7017fc818700384c2 x86: debug pmd_bad();
    but we understand now that the typecasting was wrong for PAE in the previous
    version: pagetable pages above 4GB looked bad and stopped Arjan from booting.

    And revert that cded932b75ab0a5f9181ee3da34a0a488d1a14fd x86: fix pmd_bad
    and pud_bad to support huge pages. It was the wrong way round: we shouldn't
    weaken every pmd_bad and pud_bad check to let huge pages slip through - in
    part they check that we _don't_ have a huge page where it's not expected.

    Put the x86 pmd_bad() and pud_bad() definitions back to what they have long
    been: they can be improved (x86_32 should use PTE_MASK, to stop PAE thinking
    junk in the upper word is good; and x86_64 should follow x86_32's stricter
    comparison, to stop thinking any subset of required bits is good); but that
    should be a later patch.

    Fix Hans' good observation that follow_page() will never find pmd_huge()
    because that would have already failed the pmd_bad test: test pmd_huge in
    between the pmd_none and pmd_bad tests. Tighten x86's pmd_huge() check?
    No, once it's a hugepage entry, it can get quite far from a good pmd: for
    example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits.

    However... though follow_page() contains this and another test for huge
    pages, so it's nice to keep it working on them, where does it actually get
    called on a huge page? get_user_pages() checks is_vm_hugetlb_page(vma) to
    to call alternative hugetlb processing, as does unmap_vmas() and others.

    Signed-off-by: Hugh Dickins
    Earlier-version-tested-by: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Jeff Chua
    Cc: Hans Rosenfeld
    Cc: Arjan van de Ven
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Apr, 2008

4 commits

  • vm_insert_mixed will insert either a raw pfn or a refcounted struct page into
    the page tables, depending on whether vm_normal_page() will return the page or
    not. With the introduction of the new pte bit, this is now a too tricky for
    drivers to be doing themselves.

    filemap_xip uses this in a subsequent patch.

    Signed-off-by: Nick Piggin
    Cc: Jared Hulbert
    Cc: Carsten Otte
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • s390 for one, cannot implement VM_MIXEDMAP with pfn_valid, due to their memory
    model (which is more dynamic than most). Instead, they had proposed to
    implement it with an additional path through vm_normal_page(), using a bit in
    the pte to determine whether or not the page should be refcounted:

    vm_normal_page()
    {
    ...
    if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
    if (vma->vm_flags & VM_MIXEDMAP) {
    #ifdef s390
    if (!mixedmap_refcount_pte(pte))
    return NULL;
    #else
    if (!pfn_valid(pfn))
    return NULL;
    #endif
    goto out;
    }
    ...
    }

    This is fine, however if we are allowed to use a bit in the pte to determine
    refcountedness, we can use that to _completely_ replace all the vma based
    schemes. So instead of adding more cases to the already complex vma-based
    scheme, we can have a clearly seperate and simple pte-based scheme (and get
    slightly better code generation in the process):

    vm_normal_page()
    {
    #ifdef s390
    if (!mixedmap_refcount_pte(pte))
    return NULL;
    return pte_page(pte);
    #else
    ...
    #endif
    }

    And finally, we may rather make this concept usable by any architecture rather
    than making it s390 only, so implement a new type of pte state for this.
    Unfortunately the old vma based code must stay, because some architectures may
    not be able to spare pte bits. This makes vm_normal_page a little bit more
    ugly than we would like, but the 2 cases are clearly seperate.

    So introduce a pte_special pte state, and use it in mm/memory.c. It is
    currently a noop for all architectures, so this doesn't actually result in any
    compiled code changes to mm/memory.o.

    BTW:
    I haven't put vm_normal_page() into arch code as-per an earlier suggestion.
    The reason is that, regardless of where vm_normal_page is actually
    implemented, the *abstraction* is still exactly the same. Also, while it
    depends on whether the architecture has pte_special or not, that is the
    only two possible cases, and it really isn't an arch specific function --
    the role of the arch code should be to provide primitive functions and
    accessors with which to build the core code; pte_special does that. We do
    not want architectures to know or care about vm_normal_page itself, and
    we definitely don't want them being able to invent something new there
    out of sight of mm/ code. If we made vm_normal_page an arch function, then
    we have to make vm_insert_mixed (next patch) an arch function too. So I
    don't think moving it to arch code fundamentally improves any abstractions,
    while it does practically make the code more difficult to follow, for both
    mm and arch developers, and easier to misuse.

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Nick Piggin
    Acked-by: Carsten Otte
    Cc: Jared Hulbert
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • This series introduces some important infrastructure work. The overall result
    is that:

    1. We now support XIP backed filesystems using memory that have no
    struct page allocated to them. And patches 6 and 7 actually implement
    this for s390.

    This is pretty important in a number of cases. As far as I understand,
    in the case of virtualisation (eg. s390), each guest may mount a
    readonly copy of the same filesystem (eg. the distro). Currently,
    guests need to allocate struct pages for this image. So if you have
    100 guests, you already need to allocate more memory for the struct
    pages than the size of the image. I think. (Carsten?)

    For other (eg. embedded) systems, you may have a very large non-
    volatile filesystem. If you have to have struct pages for this, then
    your RAM consumption will go up proportionally to fs size. Even
    though it is just a small proportion, the RAM can be much more costly
    eg in terms of power, so every KB less that Linux uses makes it more
    attractive to a lot of these guys.

    2. VM_MIXEDMAP allows us to support mappings where you actually do want
    to refcount _some_ pages in the mapping, but not others, and support
    COW on arbitrary (non-linear) mappings. Jared needs this for his NVRAM
    filesystem in progress. Future iterations of this filesystem will
    most likely want to migrate pages between pagecache and XIP backing,
    which is where the requirement for mixed (some refcounted, some not)
    comes from.

    3. pte_special also has a peripheral usage that I need for my lockless
    get_user_pages patch. That was shown to speed up "oltp" on db2 by
    10% on a 2 socket system, which is kind of significant because they
    scrounge for months to try to find 0.1% improvement on these
    workloads. I'm hoping we might finally be faster than AIX on
    pSeries with this :). My reference to lockless get_user_pages is not
    meant to justify this patchset (which doesn't include lockless gup),
    but just to show that pte_special is not some s390 specific thing that
    should be hidden in arch code or xip code: I definitely want to use it
    on at least x86 and powerpc as well.

    This patch:

    Introduce a new type of mapping, VM_MIXEDMAP. This is unlike VM_PFNMAP in
    that it can support COW mappings of arbitrary ranges including ranges without
    struct page *and* ranges with a struct page that we actually want to refcount
    (PFNMAP can only support COW in those cases where the un-COW-ed translations
    are mapped linearly in the virtual address, and can only support non
    refcounted ranges).

    VM_MIXEDMAP achieves this by refcounting all pfn_valid pages, and not
    refcounting !pfn_valid pages (which is not an option for VM_PFNMAP, because it
    needs to avoid refcounting pfn_valid pages eg. for /dev/mem mappings).

    Signed-off-by: Jared Hulbert
    Signed-off-by: Nick Piggin
    Acked-by: Carsten Otte
    Cc: Jared Hulbert
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jared Hulbert
     
  • Nothing in the tree uses nopage any more. Remove support for it in the
    core mm code and documentation (and a few stray references to it in
    comments).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

05 Mar, 2008

2 commits

  • Don't uncharge when do_swap_page's call to do_wp_page fails: the page which
    was charged for is there in the pagetable, and will be correctly uncharged
    when that area is unmapped - it was only its COWing which failed.

    And while we're here, remove earlier XXX comment: yes, OR in do_wp_page's
    return value (maybe VM_FAULT_WRITE) with do_swap_page's there; but if it
    fails, mask out success bits, which might confuse some arches e.g. sparc.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's nothing wrong with mem_cgroup_charge failure in do_wp_page and
    do_anonymous page using __free_page, but it does look odd when nearby code
    uses page_cache_release: use that instead (while turning a blind eye to
    ancient inconsistencies of page_cache_release versus put_page).

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

15 Feb, 2008

3 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86:
    x86: cpa, fix out of date comment
    KVM is not seen under X86 config with latest git (32 bit compile)
    x86: cpa: ensure page alignment
    x86: include proper prototypes for rodata_test
    x86: fix gart_iommu_init()
    x86: EFI set_memory_x()/set_memory_uc() fixes
    x86: make dump_pagetable() static
    x86: fix "BUG: sleeping function called from invalid context" in print_vma_addr()

    Linus Torvalds
     
  • d_path() is used on a pair. Lets use a struct path to
    reflect this.

    [akpm@linux-foundation.org: fix build in mm/memory.c]
    Signed-off-by: Jan Blunck
    Acked-by: Bryan Wu
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Cc: "J. Bruce Fields"
    Cc: Neil Brown
    Cc: Michael Halcrow
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • Jiri Kosina reported the following deadlock scenario with
    show_unhandled_signals enabled:

    [ 68.379022] gnome-settings-[2941] trap int3 ip:3d2c840f34
    sp:7fff36f5d100 error:0BUG: sleeping function called from invalid
    context at kernel/rwsem.c:21
    [ 68.379039] in_atomic():1, irqs_disabled():0
    [ 68.379044] no locks held by gnome-settings-/2941.
    [ 68.379050] Pid: 2941, comm: gnome-settings- Not tainted 2.6.25-rc1 #30
    [ 68.379054]
    [ 68.379056] Call Trace:
    [ 68.379061] [] ? __debug_show_held_locks+0x13/0x30
    [ 68.379109] [] __might_sleep+0xe5/0x110
    [ 68.379123] [] down_read+0x20/0x70
    [ 68.379137] [] print_vma_addr+0x3a/0x110
    [ 68.379152] [] do_trap+0xf5/0x170
    [ 68.379168] [] do_int3+0x7b/0xe0
    [ 68.379180] [] int3+0x9f/0xd0
    [ 68.379203] <>
    [ 68.379229] in libglib-2.0.so.0.1505.0[3d2c800000+dc000]

    and tracked it down to:

    commit 03252919b79891063cf99145612360efbdf9500b
    Author: Andi Kleen
    Date: Wed Jan 30 13:33:18 2008 +0100

    x86: print which shared library/executable faulted in segfault etc. messages

    the problem is that we call down_read() from an atomic context.

    Solve this by returning from print_vma_addr() if the preempt count is
    elevated. Update preempt_conditional_sti / preempt_conditional_cli to
    unconditionally lift the preempt count even on !CONFIG_PREEMPT.

    Reported-by: Jiri Kosina
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

12 Feb, 2008

1 commit

  • So I spent a while pounding my head against my monitor trying to figure
    out the vmsplice() vulnerability - how could a failure to check for
    *read* access turn into a root exploit? It turns out that it's a buffer
    overflow problem which is made easy by the way get_user_pages() is
    coded.

    In particular, "len" is a signed int, and it is only checked at the
    *end* of a do {} while() loop. So, if it is passed in as zero, the loop
    will execute once and decrement len to -1. At that point, the loop will
    proceed until the next invalid address is found; in the process, it will
    likely overflow the pages array passed in to get_user_pages().

    I think that, if get_user_pages() has been asked to grab zero pages,
    that's what it should do. Thus this patch; it is, among other things,
    enough to block the (already fixed) root exploit and any others which
    might be lurking in similar code. I also think that the number of pages
    should be unsigned, but changing the prototype of this function probably
    requires some more careful review.

    Signed-off-by: Jonathan Corbet
    Signed-off-by: Linus Torvalds

    Jonathan Corbet
     

09 Feb, 2008

1 commit

  • Background: I've implemented 1K/2K page tables for s390. These sub-page
    page tables are required to properly support the s390 virtualization
    instruction with KVM. The SIE instruction requires that the page tables
    have 256 page table entries (pte) followed by 256 page status table entries
    (pgste). The pgstes are only required if the process is using the SIE
    instruction. The pgstes are updated by the hardware and by the hypervisor
    for a number of reasons, one of them is dirty and reference bit tracking.
    To avoid wasting memory the standard pte table allocation should return
    1K/2K (31/64 bit) and 2K/4K if the process is using SIE.

    Problem: Page size on s390 is 4K, page table size is 1K or 2K. That means
    the s390 version for pte_alloc_one cannot return a pointer to a struct
    page. Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
    cannot return a pointer to a pte either, since that would require more than
    32 bit for the return value of pte_alloc_one (and the pte * would not be
    accessible since its not kmapped).

    Solution: The only solution I found to this dilemma is a new typedef: a
    pgtable_t. For s390 pgtable_t will be a (pte *) - to be introduced with a
    later patch. For everybody else it will be a (struct page *). The
    additional problem with the initialization of the ptl lock and the
    NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
    a destructor pgtable_page_dtor. The page table allocation and free
    functions need to call these two whenever a page table page is allocated or
    freed. pmd_populate will get a pgtable_t instead of a struct page pointer.
    To get the pgtable_t back from a pmd entry that has been installed with
    pmd_populate a new function pmd_pgtable is added. It replaces the pmd_page
    call in free_pte_range and apply_to_pte_range.

    Signed-off-by: Martin Schwidefsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     

08 Feb, 2008

2 commits

  • Nick Piggin pointed out that swap cache and page cache addition routines
    could be called from non GFP_KERNEL contexts. This patch makes the
    charging routine aware of the gfp context. Charging might fail if the
    cgroup is over it's limit, in which case a suitable error is returned.

    This patch was tested on a Powerpc box. I am still looking at being able
    to test the path, through which allocations happen in non GFP_KERNEL
    contexts.

    [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add the accounting hooks. The accounting is carried out for RSS and Page
    Cache (unmapped) pages. There is now a common limit and accounting for both.
    The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
    time. Page cache is accounted at add_to_page_cache(),
    __delete_from_page_cache(). Swap cache is also accounted for.

    Each page's page_cgroup is protected with the last bit of the
    page_cgroup pointer, this makes handling of race conditions involving
    simultaneous mappings of a page easier. A reference count is kept in the
    page_cgroup to deal with cases where a page might be unmapped from the RSS
    of all tasks, but still lives in the page cache.

    Credits go to Vaidyanathan Srinivasan for helping with reference counting work
    of the page cgroup. Almost all of the page cache accounting code has help
    from Vaidyanathan Srinivasan.

    [hugh@veritas.com: fix swapoff breakage]
    [akpm@linux-foundation.org: fix locking]
    Signed-off-by: Vaidyanathan Srinivasan
    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc:
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

07 Feb, 2008

1 commit

  • based on similar patch from: Pavel Machek

    Introduce CONFIG_COMPAT_BRK. If disabled then the kernel is free
    (but not obliged to) randomize the brk area.

    Heap randomization breaks ancient binaries, so we keep COMPAT_BRK
    enabled by default.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

06 Feb, 2008

4 commits

  • After running SetPageUptodate, preceeding stores to the page contents to
    actually bring it uptodate may not be ordered with the store to set the
    page uptodate.

    Therefore, another CPU which checks PageUptodate is true, then reads the
    page contents can get stale data.

    Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
    PageUptodate.

    Many places that test PageUptodate, do so with the page locked, and this
    would be enough to ensure memory ordering in those places if
    SetPageUptodate were only called while the page is locked. Unfortunately
    that is not always the case for some filesystems, but it could be an idea
    for the future.

    Also bring the handling of anonymous page uptodateness in line with that of
    file backed page management, by marking anon pages as uptodate when they
    _are_ uptodate, rather than when our implementation requires that they be
    marked as such. Doing allows us to get rid of the smp_wmb's in the page
    copying functions, which were especially added for anonymous pages for an
    analogous memory ordering problem. Both file and anonymous pages are
    handled with the same barriers.

    FAQ:
    Q. Why not do this in flush_dcache_page?
    A. Firstly, flush_dcache_page handles only one side (the smb side) of the
    ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
    memory barriers in a completely unrelated function is nasty; at least in the
    PageUptodate macros, they are located together with (half) the operations
    involved in the ordering. Thirdly, the smp_wmb is only required when first
    bringing the page uptodate, wheras flush_dcache_page should be called each time
    it is written to through the kernel mapping. It is logically the wrong place to
    put it.

    Q. Why does this increase my text size / reduce my performance / etc.
    A. Because it is adding the necessary instructions to eliminate the data-race.

    Q. Can it be improved?
    A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
    run under the page lock, we could avoid the smp_rmb places where PageUptodate
    is queried under the page lock. Requires audit of all filesystems and at least
    some would need reworking. That's great you're interested, I'm eagerly awaiting
    your patches.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • fastcall is always defined to be empty, remove it

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • (with Martin Schwidefsky )

    The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
    first argument. The free functions do not get the mm_struct argument. This
    is 1) asymmetrical and 2) to do mm related page table allocations the mm
    argument is needed on the free function as well.

    [kamalesh@linux.vnet.ibm.com: i386 fix]
    [akpm@linux-foundation.org: coding-syle fixes]
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Martin Schwidefsky
    Cc:
    Signed-off-by: Kamalesh Babulal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • vmtruncate is a twisted maze of gotos, this patch cleans it up to have a
    proper if else for the two major cases of extending and truncating truncate
    and thus makes it a lot more readable while keeping exactly the same
    functinality.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig