13 May, 2007

1 commit

  • SLUB cannot run on i386 at this point because i386 uses the page->private and
    page->index field of slab pages for the pgd cache.

    Make SLUB run on i386 by replacing the pgd slab cache with a quicklist.
    Limit the changes as much as possible. Leave the improvised linked list in place
    etc etc. This has been working here for a couple of weeks now.

    Acked-by: William Lee Irwin III
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

09 May, 2007

2 commits

  • Most architectures defined three macros, MK_IOSPACE_PFN(), GET_IOSPACE()
    and GET_PFN() in pgtable.h. However, the only callers of any of these
    macros are in Sparc specific code, either in arch/sparc, arch/sparc64 or
    drivers/sbus.

    This patch removes the redundant macros from all architectures except
    sparc and sparc64.

    Signed-off-by: David Gibson
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • This patch moves the die notifier handling to common code. Previous
    various architectures had exactly the same code for it. Note that the new
    code is compiled unconditionally, this should be understood as an appel to
    the other architecture maintainer to implement support for it aswell (aka
    sprinkling a notify_die or two in the proper place)

    arm had a notifiy_die that did something totally different, I renamed it to
    arm_notify_die as part of the patch and made it static to the file it's
    declared and used at. avr32 used to pass slightly less information through
    this interface and I brought it into line with the other architectures.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: fix vmalloc_sync_all bustage]
    [bryan.wu@analog.com: fix vmalloc_sync_all in nommu]
    Signed-off-by: Christoph Hellwig
    Cc:
    Cc: Russell King
    Signed-off-by: Bryan Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

08 May, 2007

2 commits

  • If you actually clear the bit, you need to:

    + pte_update_defer(vma->vm_mm, addr, ptep);

    The reason is, when updating PTEs, the hypervisor must be notified. Using
    atomic operations to do this is fine for all hypervisors I am aware of.
    However, for hypervisors which shadow page tables, if these PTE
    modifications are not trapped, you need a post-modification call to fulfill
    the update of the shadow page table.

    Acked-by: Zachary Amsden
    Cc: Hugh Dickins
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Add ptep_test_and_clear_{dirty,young} to i386. They advertise that they
    have it and there is at least one place where it needs to be called without
    the page table lock: to clear the accessed bit on write to
    /proc/pid/clear_refs.

    ptep_clear_flush_{dirty,young} are updated to use the new functions. The
    overall net effect to current users of ptep_clear_flush_{dirty,young} is
    that we introduce an additional branch.

    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Signed-off-by: David Rientjes
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

03 May, 2007

9 commits

  • Add comment and condense code to make use of native_local_ptep_get_and_clear
    function. Also, it turns out the 2-level and 3-level paging definitions were
    identical, so move the common definition into pgtable.h

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Andi Kleen

    Zachary Amsden
     
  • When exiting from an address space, no special hypervisor notification of page
    table updates needs to occur; direct page table hypervisors, such as Xen,
    switch to another address space first (init_mm) and unprotects the page tables
    to avoid the cost of trapping to the hypervisor for each pte_clear. Shadow
    mode hypervisors, such as VMI and lhype don't need to do the extra work of
    calling through paravirt-ops, and can just directly clear the page table
    entries without notifiying the hypervisor, since all the page tables are about
    to be freed.

    So introduce native_pte_clear functions which bypass any paravirt-ops
    notification. This results in a significant performance win for VMI and
    removes some indirect calls from zap_pte_range.

    Note the 3-level paging already had a native_pte_clear function, thus
    demanding argument conformance and extra args for the 2-level definition.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Andi Kleen

    Zachary Amsden
     
  • In shadow mode hypervisors, ptep_get_and_clear achieves the desired
    purpose of keeping the shadows in sync by issuing a native_get_and_clear,
    followed by a call to pte_update, which indicates the PTE has been
    modified.

    Direct mode hypervisors (Xen) have no need for this anyway, and will trap
    the update using writable pagetables.

    This means no hypervisor makes use of ptep_get_and_clear; there is no
    reason to have it in the paravirt-ops structure. Change confusing
    terminology about raw vs. native functions into consistent use of
    native_pte_xxx for operations which do not invoke paravirt-ops.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andi Kleen

    Jeremy Fitzhardinge
     
  • Xen and VMI both have special requirements when mapping a highmem pte
    page into the kernel address space. These can be dealt with by adding
    a new kmap_atomic_pte() function for mapping highptes, and hooking it
    into the paravirt_ops infrastructure.

    Xen specifically wants to map the pte page RO, so this patch exposes a
    helper function, kmap_atomic_prot, which maps the page with the
    specified page protections.

    This also adds a kmap_flush_unused() function to clear out the cached
    kmap mappings. Xen needs this to clear out any potential stray RW
    mappings of pages which will become part of a pagetable.

    [ Zach - vmi.c will need some attention after this patch. It wasn't
    immediately obvious to me what needs to be done. ]

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: Zachary Amsden

    Jeremy Fitzhardinge
     
  • Back out the map_pt_hook to clear the way for kmap_atomic_pte.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: Zachary Amsden

    Jeremy Fitzhardinge
     
  • Normally when running in PAE mode, the 4th PMD maps the kernel address space,
    which can be shared among all processes (since they all need the same kernel
    mappings).

    Xen, however, does not allow guests to have the kernel pmd shared between page
    tables, so parameterize pgtable.c to allow both modes of operation.

    There are several side-effects of this. One is that vmalloc will update the
    kernel address space mappings, and those updates need to be propagated into
    all processes if the kernel mappings are not intrinsically shared. In the
    non-PAE case, this is done by maintaining a pgd_list of all processes; this
    list is used when all process pagetables must be updated. pgd_list is
    threaded via otherwise unused entries in the page structure for the pgd, which
    means that the pgd must be page-sized for this to work.

    Normally the PAE pgd is only 4x64 byte entries large, but Xen requires the PAE
    pgd to page aligned anyway, so this patch forces the pgd to be page
    aligned+sized when the kernel pmd is unshared, to accomodate both these
    requirements.

    Also, since there may be several distinct kernel pmds (if the user/kernel
    split is below 3G), there's no point in allocating them from a slab cache;
    they're just allocated with get_free_page and initialized appropriately. (Of
    course the could be cached if there is just a single kernel pmd - which is the
    default with a 3G user/kernel split - but it doesn't seem worthwhile to add
    yet another case into this code).

    [ Many thanks to wli for review comments. ]

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: William Lee Irwin III
    Signed-off-by: Andi Kleen
    Cc: Zachary Amsden
    Cc: Christoph Lameter
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton

    Jeremy Fitzhardinge
     
  • This patch introduces paravirt_ops hooks to control how the kernel's
    initial pagetable is set up.

    In the case of a native boot, the very early bootstrap code creates a
    simple non-PAE pagetable to map the kernel and physical memory. When
    the VM subsystem is initialized, it creates a proper pagetable which
    respects the PAE mode, large pages, etc.

    When booting under a hypervisor, there are many possibilities for what
    paging environment the hypervisor establishes for the guest kernel, so
    the constructon of the kernel's pagetable depends on the hypervisor.

    In the case of Xen, the hypervisor boots the kernel with a fully
    constructed pagetable, which is already using PAE if necessary. Also,
    Xen requires particular care when constructing pagetables to make sure
    all pagetables are always mapped read-only.

    In order to make this easier, kernel's initial pagetable construction
    has been changed to only allocate and initialize a pagetable page if
    there's no page already present in the pagetable. This allows the Xen
    paravirt backend to make a copy of the hypervisor-provided pagetable,
    allowing the kernel to establish any more mappings it needs while
    keeping the existing ones.

    A slightly subtle point which is worth highlighting here is that Xen
    requires all kernel mappings to share the same pte_t pages between all
    pagetables, so that updating a kernel page's mapping in one pagetable
    is reflected in all other pagetables. This makes it possible to
    allocate a page and attach it to a pagetable without having to
    explicitly enumerate that page's mapping in all pagetables.

    And:

    +From: "Eric W. Biederman"

    If we don't set the leaf page table entries it is quite possible that
    will inherit and incorrect page table entry from the initial boot
    page table setup in head.S. So we need to redo the effort here,
    so we pick up PSE, PGE and the like.

    Hypervisors like Xen require that their page tables be read-only,
    which is slightly incompatible with our low identity mappings, however
    I discussed this with Jeremy he has modified the Xen early set_pte
    function to avoid problems in this area.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Acked-by: William Irwin
    Cc: Ingo Molnar

    Jeremy Fitzhardinge
     
  • Add a set of accessors to pack, unpack and modify page table entries
    (at all levels). This allows a paravirt implementation to control the
    contents of pgd/pmd/pte entries. For example, Xen uses this to
    convert the (pseudo-)physical address into a machine address when
    populating a pagetable entry, and converting back to pphys address
    when an entry is read.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Acked-by: Ingo Molnar

    Jeremy Fitzhardinge
     
  • Fix various broken corner cases in i386 and x86-64 change_page_attr.

    AK: split off from tighten kernel image access rights

    Signed-off-by: Jan Beulich
    Signed-off-by: Andi Kleen

    Jan Beulich
     

05 Mar, 2007

1 commit

  • Provide a PT map hook for HIGHPTE kernels to designate where they are mapping
    page tables. This information is required so the physical address of PTE
    updates can be determined; otherwise, the mm layer would have to carry the
    physical address all the way to each PTE modification callsite, which is even
    more hideous that the macros required to provide the proper hooks.

    So lets not mess up arch neutral code to achieve this, but keep the horror in
    an #ifdef HIGHPTE in include/asm-i386/pgtable.h. I had to use macros here
    because some types are not yet defined in all the include paths for this
    header.

    This patch is absolutely required for HIGHPTE kernels to operate properly with
    VMI.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     

08 Dec, 2006

2 commits

  • * 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (156 commits)
    [PATCH] x86-64: Export smp_call_function_single
    [PATCH] i386: Clean up smp_tune_scheduling()
    [PATCH] unwinder: move .eh_frame to RODATA
    [PATCH] unwinder: fully support linker generated .eh_frame_hdr section
    [PATCH] x86-64: don't use set_irq_regs()
    [PATCH] x86-64: check vector in setup_ioapic_dest to verify if need setup_IO_APIC_irq
    [PATCH] x86-64: Make ix86 default to HIGHMEM4G instead of NOHIGHMEM
    [PATCH] i386: replace kmalloc+memset with kzalloc
    [PATCH] x86-64: remove remaining pc98 code
    [PATCH] x86-64: remove unused variable
    [PATCH] x86-64: Fix constraints in atomic_add_return()
    [PATCH] x86-64: fix asm constraints in i386 atomic_add_return
    [PATCH] x86-64: Correct documentation for bzImage protocol v2.05
    [PATCH] x86-64: replace kmalloc+memset with kzalloc in MTRR code
    [PATCH] x86-64: Fix numaq build error
    [PATCH] x86-64: include/asm-x86_64/cpufeature.h isn't a userspace header
    [PATCH] unwinder: Add debugging output to the Dwarf2 unwinder
    [PATCH] x86-64: Clarify error message in GART code
    [PATCH] x86-64: Fix interrupt race in idle callback (3rd try)
    [PATCH] x86-64: Remove unwind stack pointer alignment forcing again
    ...

    Fixed conflict in include/linux/uaccess.h manually

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Replace all uses of kmem_cache_t with struct kmem_cache.

    The patch was generated using the following script:

    #!/bin/sh
    #
    # Replace one string by another in all the kernel sources.
    #

    set -e

    for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
    quilt add $file
    sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
    mv /tmp/$$ $file
    quilt refresh
    done

    The script was run like this

    sh replace kmem_cache_t "struct kmem_cache"

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

07 Dec, 2006

3 commits

  • The function ptep_get_and_clear uses an atomic instruction sequence to get and
    clear an active pte. Rather than add such an atomic operator to all virtual
    machine implementations in paravirt-ops, it is easier to support the raw
    atomic sequence and use either a trapping writable pagetable approach, or a
    post-update notification. For the post update notification, we require the
    pte_update function to be called after the access. Combine the 2-level and
    3-level paging operators into one common function which does the post-update
    notification, and rename the actual atomic sequences to raw_ptep_xxx
    operators.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andi Kleen
    Cc: Andi Kleen
    Cc: Jeremy Fitzhardinge
    Cc: Chris Wright
    Signed-off-by: Andrew Morton

    Zachary Amsden
     
  • Make parameter names match function argument names for the yet to be defined
    pte_update_defer accessor.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andi Kleen
    Cc: Andi Kleen
    Cc: Jeremy Fitzhardinge
    Cc: Chris Wright
    Signed-off-by: Andrew Morton

    Zachary Amsden
     
  • Add the three bare TLB accessor functions to paravirt-ops. Most amusingly,
    flush_tlb is redefined on SMP, so I can't call the paravirt op flush_tlb.
    Instead, I chose to indicate the actual flush type, kernel (global) vs. user
    (non-global). Global in this sense means using the global bit in the page
    table entry, which makes TLB entries persistent across CR3 reloads, not
    global as in the SMP sense of invoking remote shootdowns, so the term is
    confusingly overloaded.

    AK: folded in fix from Zach for PAE compilation

    Signed-off-by: Zachary Amsden
    Signed-off-by: Chris Wright
    Signed-off-by: Andi Kleen
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton

    Rusty Russell
     

01 Oct, 2006

4 commits

  • Add a pte_update_hook which notifies about pte changes that have been made
    without using the set_pte / clear_pte interfaces. This allows shadow mode
    hypervisors which do not trap on page table access to maintain synchronized
    shadows.

    It also turns out, there was one pte update in PAE mode that wasn't using any
    accessor interface at all for setting NX protection. Considering it is PAE
    specific, and the accessor is i386 specific, I didn't want to add a generic
    encapsulation of this behavior yet.

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • The ptep_establish macro is only used on user-level PTEs, for P->P mapping
    changes. Since these always happen under protection of the pagetable lock,
    the strong synchronization of a 64-bit cmpxchg is not needed, in fact, not
    even a lock prefix needs to be used. We can simply instead clear the P-bit,
    followed by a normal set. The write ordering is still important to avoid the
    possibility of the TLB snooping a partially written PTE and getting a bad
    mapping installed.

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Create a new PTE function which combines clearing a kernel PTE with the
    subsequent flush. This allows the two to be easily combined into a single
    hypercall or paravirt-op. More subtly, reverse the order of the flush for
    kmap_atomic. Instead of flushing on establishing a mapping, flush on clearing
    a mapping. This eliminates the possibility of leaving stale kmap entries
    which may still have valid TLB mappings. This is required for direct mode
    hypervisors, which need to reprotect all mappings of a given page when
    changing the page type from a normal page to a protected page (such as a page
    table or descriptor table page). But it also provides some nicer semantics
    for real hardware, by providing extra debug-proofing against using stale
    mappings, as well as ensuring that no stale mappings exist when changing the
    cacheability attributes of a page, which could lead to cache conflicts when
    two different types of mappings exist for the same page.

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Remove ptep_test_and_clear_{dirty|young} from i386, and instead use the
    dominating functions, ptep_clear_flush_{dirty|young}. This allows the TLB
    page flush to be contained in the same macro, and allows for an eager
    optimization - if reading the PTE initially returned dirty/accessed, we can
    assume the fact that no subsequent update to the PTE which cleared accessed /
    dirty has occurred, as the only way A/D bits can change without holding the
    page table lock is if a remote processor clears them. This eliminates an
    extra branch which came from the generic version of the code, as we know that
    no other CPU could have cleared the A/D bit, so the flush will always be
    needed.

    We still export these two defines, even though we do not actually define
    the macros in the i386 code:

    #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
    #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY

    The reason for this is that the only use of these functions is within the
    generic clear_flush functions, and we want a strong guarantee that there
    are no other users of these functions, so we want to prevent the generic
    code from defining them for us.

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     

27 Sep, 2006

1 commit

  • * 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (225 commits)
    [PATCH] Don't set calgary iommu as default y
    [PATCH] i386/x86-64: New Intel feature flags
    [PATCH] x86: Add a cumulative thermal throttle event counter.
    [PATCH] i386: Make the jiffies compares use the 64bit safe macros.
    [PATCH] x86: Refactor thermal throttle processing
    [PATCH] Add 64bit jiffies compares (for use with get_jiffies_64)
    [PATCH] Fix unwinder warning in traps.c
    [PATCH] x86: Allow disabling early pci scans with pci=noearly or disallowing conf1
    [PATCH] x86: Move direct PCI scanning functions out of line
    [PATCH] i386/x86-64: Make all early PCI scans dependent on CONFIG_PCI
    [PATCH] Don't leak NT bit into next task
    [PATCH] i386/x86-64: Work around gcc bug with noreturn functions in unwinder
    [PATCH] Fix some broken white space in ia32_signal.c
    [PATCH] Initialize argument registers for 32bit signal handlers.
    [PATCH] Remove all traces of signal number conversion
    [PATCH] Don't synchronize time reading on single core AMD systems
    [PATCH] Remove outdated comment in x86-64 mmconfig code
    [PATCH] Use string instructions for Core2 copy/clear
    [PATCH] x86: - restore i8259A eoi status on resume
    [PATCH] i386: Split multi-line printk in oops output.
    ...

    Linus Torvalds
     

26 Sep, 2006

4 commits

  • Move ptep_set_access_flags to be closer to the other ptep accessors, and make
    the indentation standard.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • Move the __HAVE_ARCH_PTEP defines to accompany the function definitions.
    Anything else is just a complete nightmare to track through the 2/3-level
    paging code, and this caused duplicate definitions to be needed (pte_same),
    which could have easily been taken care of with the asm-generic pgtable
    functions.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • One of the changes necessary for shared page tables is to standardize the
    pxx_page macros. pte_page and pmd_page have always returned the struct
    page associated with their entry, while pte_page_kernel and pmd_page_kernel
    have returned the kernel virtual address. pud_page and pgd_page, on the
    other hand, return the kernel virtual address.

    Shared page tables needs pud_page and pgd_page to return the actual page
    structures. There are very few actual users of these functions, so it is
    simple to standardize their usage.

    Since this is basic cleanup, I am submitting these changes as a standalone
    patch. Per Hugh Dickins' comments about it, I am also changing the
    pxx_page_kernel macros to pxx_page_vaddr to clarify their meaning.

    Signed-off-by: Dave McCracken
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave McCracken
     
  • This patch replaces the open-coded early commandline parsing
    throughout the i386 boot code with the generic mechanism (already used
    by ppc, powerpc, ia64 and s390). The code was inconsistent with
    whether it deletes the option from the cmdline or not, meaning some of
    these will get passed through the environment into init.

    This transformation is mainly mechanical, but there are some notable
    parts:

    1) Grammar: s/linux never set's it up/linux never sets it up/

    2) Remove hacked-in earlyprintk= option scanning. When someone
    actually implements CONFIG_EARLY_PRINTK, then they can use
    early_param().
    [AK: actually it is implemented, but I'm adding the early_param it in the next
    x86-64 patch]

    3) Move declaration of generic_apic_probe() from setup.c into asm/apic.h

    4) Various parameters now moved into their appropriate files (thanks Andi).

    5) All parse functions which examine arg need to check for NULL,
    except one where it has subtle humor value.

    AK: readded acpi_sci handling which was completely dropped
    AK: moved some more variables into acpi/boot.c

    Cc: len.brown@intel.com

    Signed-off-by: Rusty Russell
    Signed-off-by: Andi Kleen

    Rusty Russell
     

29 Apr, 2006

1 commit


28 Apr, 2006

1 commit

  • Proposed fix for ptep_get_and_clear_full PAE bug. Pte_clear had the same bug,
    so use the same fix for both. Turns out pmd_clear had it as well, but pgds
    are not affected.

    The problem is rather intricate. Page table entries in PAE mode are 64-bits
    wide, but the only atomic 8-byte write operation available in 32-bit mode is
    cmpxchg8b, which is expensive (at least on P4), and thus avoided. But it can
    happen that the processor may prefetch entries into the TLB in the middle of an
    operation which clears a page table entry. So one must always clear the P-bit
    in the low word of the page table entry first when clearing it.

    Since the sequence *ptep = __pte(0) leaves the order of the write dependent on
    the compiler, it must be coded explicitly as a clear of the low word followed
    by a clear of the high word. Further, there must be a write memory barrier
    here to enforce proper ordering by the compiler (and, in the future, by the
    processor as well).

    On > 4GB memory machines, the implementation of pte_clear for PAE was clearly
    deficient, as it could leave virtual mappings of physical memory above 4GB
    aliased to memory below 4GB in the TLB. The implementation of
    ptep_get_and_clear_full has a similar bug, although not nearly as likely to
    occur, since the mappings being cleared are in the process of being destroyed,
    and should never be dereferenced again.

    But, as luck would have it, it is possible to trigger bugs even without ever
    dereferencing these bogus TLB mappings, even if the clear is followed fairly
    soon after with a TLB flush or invalidation. The problem is that memory above
    4GB may now be aliased into the first 4GB of memory, and in fact, may hit a
    region of memory with non-memory semantics. These regions include AGP and PCI
    space. As such, these memory regions are not cached by the processor. This
    introduces the bug.

    The processor can speculate memory operations, including memory writes, as long
    as they are committed with the proper ordering. Speculating a memory write to
    a linear address that has a bogus TLB mapping is possible. Normally, the
    speculation is harmless. But for cached memory, it does leave the falsely
    speculated cacheline unmodified, but in a dirty state. This cache line will be
    eventually written back. If this cacheline happens to intersect a region of
    memory that is not protected by the cache coherency protocol, it can corrupt
    data in I/O memory, which is generally a very bad thing to do, and can cause
    total system failure or just plain undefined behavior.

    These bugs are extremely unlikely, but the severity is of such magnitude, and
    the fix so simple that I think fixing them immediately is justified. Also,
    they are nearly impossible to debug.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     

26 Apr, 2006

1 commit


22 Mar, 2006

1 commit

  • 2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb
    mprotect.

    From: David Gibson

    Remove a test from the mprotect() path which checks that the mprotect()ed
    range on a hugepage VMA is hugepage aligned (yes, really, the sense of
    is_aligned_hugepage_range() is the opposite of what you'd guess :-/).

    In fact, we don't need this test. If the given addresses match the
    beginning/end of a hugepage VMA they must already be suitably aligned. If
    they don't, then mprotect_fixup() will attempt to split the VMA. The very
    first test in split_vma() will check for a badly aligned address on a
    hugepage VMA and return -EINVAL if necessary.

    From: "Chen, Kenneth W"

    On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE. The
    identify of hugetlb pte is lost when changing page protection via mprotect.
    A page fault occurs later will trigger a bug check in huge_pte_alloc().

    The fix is to always make new pte a hugetlb pte and also to clean up
    legacy code where _PAGE_PRESENT is forced on in the pre-faulting day.

    Signed-off-by: Zhang Yanmin
    Cc: David Gibson
    Cc: "David S. Miller"
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: William Lee Irwin III
    Signed-off-by: Ken Chen
    Signed-off-by: Nishanth Aravamudan
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang, Yanmin
     

07 Nov, 2005

1 commit

  • Fix more include file problems that surfaced since I submitted the previous
    fix-missing-includes.patch. This should now allow not to include sched.h
    from module.h, which is done by a followup patch.

    Signed-off-by: Tim Schmielau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Schmielau
     

31 Oct, 2005

2 commits

  • This patch removes page_pte_prot and page_pte macros from all
    architectures. Some architectures define both, some only page_pte (broken)
    and others none. These macros are not used anywhere.

    page_pte_prot(page, prot) is identical to mk_pte(page, prot) and
    page_pte(page) is identical to page_pte_prot(page, __pgprot(0)).

    * The following architectures define both page_pte_prot and page_pte

    arm, arm26, ia64, sh64, sparc, sparc64

    * The following architectures define only page_pte (broken)

    frv, i386, m32r, mips, sh, x86-64

    * All other architectures define neither

    Signed-off-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Join together some common functions (pmd_page{,_kernel}) over 2level and
    3level pages.

    Signed-off-by: Paolo 'Blaisorblade' Giarrusso
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paolo 'Blaisorblade' Giarrusso
     

30 Oct, 2005

1 commit

  • Convert those common loops using page_table_lock on the outside and
    pte_offset_map within to use just pte_offset_map_lock within instead.

    These all hold mmap_sem (some exclusively, some not), so at no level can a
    page table be whipped away from beneath them. But whereas pte_alloc loops
    tested with the "atomic" pmd_present, these loops are testing with pmd_none,
    which on i386 PAE tests both lower and upper halves.

    That's now unsafe, so add a cast into pmd_none to test only the vital lower
    half: we lose a little sensitivity to a corrupt middle directory, but not
    enough to worry about. It appears that i386 and UML were the only
    architectures vulnerable in this way, and pgd and pud no problem.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

13 Sep, 2005

1 commit


05 Sep, 2005

2 commits

  • Add a clone operation for pgd updates.

    This helps complete the encapsulation of updates to page tables (or pages
    about to become page tables) into accessor functions rather than using
    memcpy() to duplicate them. This is both generally good for consistency
    and also necessary for running in a hypervisor which requires explicit
    updates to page table entries.

    The new function is:

    clone_pgd_range(pgd_t *dst, pgd_t *src, int count);

    dst - pointer to pgd range anwhere on a pgd page
    src - ""
    count - the number of pgds to copy.

    dst and src can be on the same page, but the range must not overlap
    and must not cross a page boundary.

    Note that I ommitted using this call to copy pgd entries into the
    software suspend page root, since this is not technically a live paging
    structure, rather it is used on resume from suspend. CC'ing Pavel in case
    he has any feedback on this.

    Thanks to Chris Wright for noticing that this could be more optimal in
    PAE compiles by eliminating the memset.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Add a new accessor for PTEs, which passes the full hint from the mmu_gather
    struct; this allows architectures with hardware pagetables to optimize away
    atomic PTE operations when destroying an address space. Removing the
    locked operation should allow better pipelining of memory access in this
    loop. I measured an average savings of 30-35 cycles per zap_pte_range on
    the first 500 destructions on Pentium-M, but I believe the optimization
    would win more on older processors which still assert the bus lock on xchg
    for an exclusive cacheline.

    Update: I made some new measurements, and this saves exactly 26 cycles over
    ptep_get_and_clear on Pentium M. On P4, with a PAE kernel, this saves 180
    cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a
    full address space destruction.

    pte_clear_full is not yet used, but is provided for future optimizations
    (in particular, when running inside of a hypervisor that queues page table
    updates, the full hint allows us to avoid queueing unnecessary page table
    update for an address space in the process of being destroyed.

    This is not a huge win, but it does help a bit, and sets the stage for
    further hypervisor optimization of the mm layer on all architectures.

    Signed-off-by: Zachary Amsden
    Cc: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden