29 Apr, 2008

1 commit


28 Apr, 2008

9 commits

  • If there's no VSA2 (ie, if we're using tinybios or OpenFirmware), use the
    GLIU's P2D Range Offset Descriptor to determine how much memory we have
    available for the framebuffer.

    Originally based on a patch by Jordan Crouse. Tested with OpenFirmware;
    Pascal informs me that tinybios has a stub that fills in P2D_RO0.

    Signed-off-by: Andres Salomon
    Cc: Jordan Crouse
    Cc: "Antonino A. Daplas"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andres Salomon
     
  • ..Rather than using magic constants.

    Signed-off-by: Andres Salomon
    Cc: Jordan Crouse
    Cc: "Antonino A. Daplas"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andres Salomon
     
  • This is generic VSA2 detection. It's used by OLPC to determine whether or not
    the BIOS contains VSA2, but since other BIOSes are coming out that don't use
    the VSA (ie, tinybios), it might end up being useful for others.

    Signed-off-by: Andres Salomon
    Acked-by: Alan Cox
    Cc: Jordan Crouse
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andres Salomon
     
  • This cleans up a few MSR-using drivers in the following manner:
    - Ensures MSRs are all defined in asm/geode.h, rather than in misc
    places
    - Makes the naming consistent; cs553[56] ones begin with MSR_,
    GX-specific ones start with MSR_GX_, and LX-specific ones start
    with MSR_LX_. Also, make the names match the data sheet.
    - Use MSR names rather than numbers in source code
    - Document the fact that the LX's MSR_PADSEL has the wrong value
    in the data sheet. That's, uh, good to note.

    Signed-off-by: Andres Salomon
    Acked-by: Jordan Crouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andres Salomon
     
  • Huge ptes have a special type on s390 and cannot be handled with the standard
    pte functions in certain cases, e.g. because of a different location of the
    invalid bit. This patch adds some new architecture- specific functions to
    hugetlb common code, as a prerequisite for the s390 large page support.

    This won't affect other architectures in functionality, but I need to add some
    new dummy inline functions to the headers.

    Acked-by: Martin Schwidefsky
    Signed-off-by: Gerald Schaefer
    Cc: Paul Mundt
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • A cow break on a hugetlbfs page with page_count > 1 will set a new pte with
    set_huge_pte_at(), w/o any tlb flush operation. The old pte will remain in
    the tlb and subsequent write access to the page will result in a page fault
    loop, for as long as it may take until the tlb is flushed from somewhere else.
    This patch introduces an architecture-specific huge_ptep_clear_flush()
    function, which is called before the the set_huge_pte_at() in hugetlb_cow().

    ATTENTION: This is just a nop on all architectures for now, the s390
    implementation will come with our large page patch later. Other architectures
    should define their own huge_ptep_clear_flush() if needed.

    Acked-by: Martin Schwidefsky
    Signed-off-by: Gerald Schaefer
    Cc: Paul Mundt
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • This patch moves all architecture functions for hugetlb to architecture header
    files (include/asm-foo/hugetlb.h) and converts all macros to inline functions.
    It also removes (!) ARCH_HAS_HUGEPAGE_ONLY_RANGE,
    ARCH_HAS_HUGETLB_FREE_PGD_RANGE, ARCH_HAS_PREPARE_HUGEPAGE_RANGE,
    ARCH_HAS_SETCLEAR_HUGE_PTE and ARCH_HAS_HUGETLB_PREFAULT_HOOK.

    Getting rid of the ARCH_HAS_xxx #ifdef and macro fugliness should increase
    readability and maintainability, at the price of some code duplication. An
    asm-generic common part would have reduced the loc, but we would end up with
    new ARCH_HAS_xxx defines eventually.

    Acked-by: Martin Schwidefsky
    Signed-off-by: Gerald Schaefer
    Cc: Paul Mundt
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • s390 for one, cannot implement VM_MIXEDMAP with pfn_valid, due to their memory
    model (which is more dynamic than most). Instead, they had proposed to
    implement it with an additional path through vm_normal_page(), using a bit in
    the pte to determine whether or not the page should be refcounted:

    vm_normal_page()
    {
    ...
    if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
    if (vma->vm_flags & VM_MIXEDMAP) {
    #ifdef s390
    if (!mixedmap_refcount_pte(pte))
    return NULL;
    #else
    if (!pfn_valid(pfn))
    return NULL;
    #endif
    goto out;
    }
    ...
    }

    This is fine, however if we are allowed to use a bit in the pte to determine
    refcountedness, we can use that to _completely_ replace all the vma based
    schemes. So instead of adding more cases to the already complex vma-based
    scheme, we can have a clearly seperate and simple pte-based scheme (and get
    slightly better code generation in the process):

    vm_normal_page()
    {
    #ifdef s390
    if (!mixedmap_refcount_pte(pte))
    return NULL;
    return pte_page(pte);
    #else
    ...
    #endif
    }

    And finally, we may rather make this concept usable by any architecture rather
    than making it s390 only, so implement a new type of pte state for this.
    Unfortunately the old vma based code must stay, because some architectures may
    not be able to spare pte bits. This makes vm_normal_page a little bit more
    ugly than we would like, but the 2 cases are clearly seperate.

    So introduce a pte_special pte state, and use it in mm/memory.c. It is
    currently a noop for all architectures, so this doesn't actually result in any
    compiled code changes to mm/memory.o.

    BTW:
    I haven't put vm_normal_page() into arch code as-per an earlier suggestion.
    The reason is that, regardless of where vm_normal_page is actually
    implemented, the *abstraction* is still exactly the same. Also, while it
    depends on whether the architecture has pte_special or not, that is the
    only two possible cases, and it really isn't an arch specific function --
    the role of the arch code should be to provide primitive functions and
    accessors with which to build the core code; pte_special does that. We do
    not want architectures to know or care about vm_normal_page itself, and
    we definitely don't want them being able to invent something new there
    out of sight of mm/ code. If we made vm_normal_page an arch function, then
    we have to make vm_insert_mixed (next patch) an arch function too. So I
    don't think moving it to arch code fundamentally improves any abstractions,
    while it does practically make the code more difficult to follow, for both
    mm and arch developers, and easier to misuse.

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Nick Piggin
    Acked-by: Carsten Otte
    Cc: Jared Hulbert
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • * 'kvm-updates-2.6.26' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (147 commits)
    KVM: kill file->f_count abuse in kvm
    KVM: MMU: kvm_pv_mmu_op should not take mmap_sem
    KVM: SVM: remove selective CR0 comment
    KVM: SVM: remove now obsolete FIXME comment
    KVM: SVM: disable CR8 intercept when tpr is not masking interrupts
    KVM: SVM: sync V_TPR with LAPIC.TPR if CR8 write intercept is disabled
    KVM: export kvm_lapic_set_tpr() to modules
    KVM: SVM: sync TPR value to V_TPR field in the VMCB
    KVM: ppc: PowerPC 440 KVM implementation
    KVM: Add MAINTAINERS entry for PowerPC KVM
    KVM: ppc: Add DCR access information to struct kvm_run
    ppc: Export tlb_44x_hwater for KVM
    KVM: Rename debugfs_dir to kvm_debugfs_dir
    KVM: x86 emulator: fix lea to really get the effective address
    KVM: x86 emulator: fix smsw and lmsw with a memory operand
    KVM: x86 emulator: initialize src.val and dst.val for register operands
    KVM: SVM: force a new asid when initializing the vmcb
    KVM: fix kvm_vcpu_kick vs __vcpu_run race
    KVM: add ioctls to save/store mpstate
    KVM: Rename VCPU_MP_STATE_* to KVM_MP_STATE_*
    ...

    Linus Torvalds
     

27 Apr, 2008

30 commits

  • So userspace can save/restore the mpstate during migration.

    [avi: export the #define constants describing the value]
    [christian: add s390 stubs]
    [avi: ditto for ia64]

    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Carsten Otte
    Signed-off-by: Avi Kivity

    Marcelo Tosatti
     
  • We wish to export it to userspace, so move it into the kvm namespace.

    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Trace markers allow userspace to trace execution of a virtual machine
    in order to monitor its performance.

    Signed-off-by: Feng (Eric) Liu
    Signed-off-by: Avi Kivity

    Feng (Eric) Liu
     
  • To properly forward a MCE occured while the guest is running to the host, we
    have to intercept this exception and call the host handler by hand. This is
    implemented by this patch.

    Signed-off-by: Joerg Roedel
    Signed-off-by: Avi Kivity

    Joerg Roedel
     
  • This patch introduces a gfn_to_pfn() function and corresponding functions like
    kvm_release_pfn_dirty(). Using these new functions, we can modify the x86
    MMU to no longer assume that it can always get a struct page for any given gfn.

    We don't want to eliminate gfn_to_page() entirely because a number of places
    assume they can do gfn_to_page() and then kmap() the results. When we support
    IO memory, gfn_to_page() will fail for IO pages although gfn_to_pfn() will
    succeed.

    This does not implement support for avoiding reference counting for reserved
    RAM or for IO memory. However, it should make those things pretty straight
    forward.

    Since we're only introducing new common symbols, I don't think it will break
    the non-x86 architectures but I haven't tested those. I've tested Intel,
    AMD, NPT, and hugetlbfs with Windows and Linux guests.

    [avi: fix overflow when shifting left pfns by adding casts]

    Signed-off-by: Anthony Liguori
    Signed-off-by: Avi Kivity

    Anthony Liguori
     
  • The kvm_host.h file for x86 declares the functions kvm_set_cr[0348]. In the
    header file their second parameter is named cr0 in all cases. This patch
    renames the parameters so that they match the function name.

    Signed-off-by: Joerg Roedel
    Signed-off-by: Avi Kivity

    Joerg Roedel
     
  • Unify slots_lock acquision around vcpu_run(). This is simpler and less
    error-prone.

    Also fix some callsites that were not grabbing the lock properly.

    [avi: drop slots_lock while in guest mode to avoid holding the lock
    for indefinite periods]

    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Marcelo Tosatti
     
  • This emulates the x86 hardware task switch mechanism in software, as it is
    unsupported by either vmx or svm. It allows operating systems which use it,
    like freedos, to run as kvm guests.

    Signed-off-by: Izik Eidus
    Signed-off-by: Avi Kivity

    Izik Eidus
     
  • Signed-off-by: Izik Eidus
    Signed-off-by: Avi Kivity

    Izik Eidus
     
  • Signed-off-by: Avi Kivity

    Avi Kivity
     
  • it will allow external users to call it. It is mainly
    useful for routines that will override its machine_ops
    field for its own special purposes, but want to call the
    normal shutdown routine after they're done

    Signed-off-by: Glauber Costa
    Signed-off-by: Avi Kivity

    Glauber Costa
     
  • This patch a llows machine_crash_shutdown to
    be replaced, just like any of the other functions
    in machine_ops

    Signed-off-by: Glauber Costa
    Signed-off-by: Avi Kivity

    Glauber Costa
     
  • Hypercall based pte updates are faster than faults, and also allow use
    of the lazy MMU mode to batch operations.

    Don't report the feature if two dimensional paging is enabled.

    [avi:
    - one mmu_op hypercall instead of one per op
    - allow 64-bit gpa on hypercall
    - don't pass host errors (-ENOMEM) to guest]

    [akpm: warning fix on i386]

    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Avi Kivity

    Marcelo Tosatti
     
  • Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Add basic KVM paravirt support. Avoid vm-exits on IO delays.

    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Marcelo Tosatti
     
  • Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • The patch moves the PIT model from userspace to kernel, and increases
    the timer accuracy greatly.

    [marcelo: make last_injected_time per-guest]

    Signed-off-by: Sheng Yang
    Signed-off-by: Marcelo Tosatti
    Tested-and-Acked-by: Alex Davis
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • Names like 'set_cr3()' look dangerously close to affecting the host.

    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Create large pages mappings if the guest PTE's are marked as such and
    the underlying memory is hugetlbfs backed. If the largepage contains
    write-protected pages, a large pte is not used.

    Gives a consistent 2% improvement for data copies on ram mounted
    filesystem, without NPT/EPT.

    Anthony measures a 4% improvement on 4-way kernbench, with NPT.

    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Marcelo Tosatti
     
  • Mark zapped root pagetables as invalid and ignore such pages during lookup.

    This is a problem with the cr3-target feature, where a zapped root table fools
    the faulting code into creating a read-only mapping. The result is a lockup
    if the instruction can't be emulated.

    Signed-off-by: Marcelo Tosatti
    Cc: Anthony Liguori
    Signed-off-by: Avi Kivity

    Marcelo Tosatti
     
  • Signed-off-by: Amit Shah
    Signed-off-by: Avi Kivity

    Amit Shah
     
  • This is the host part of kvm clocksource implementation. As it does
    not include clockevents, it is a fairly simple implementation. We
    only have to register a per-vcpu area, and start writing to it periodically.

    The area is binary compatible with xen, as we use the same shadow_info
    structure.

    [marcelo: fix bad_page on MSR_KVM_SYSTEM_TIME]
    [avi: save full value of the msr, even if enable bit is clear]
    [avi: clear previous value of time_page]

    Signed-off-by: Glauber de Oliveira Costa
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Glauber de Oliveira Costa
     
  • The load_pdptrs() function is required in the SVM module for NPT support.

    Signed-off-by: Joerg Roedel
    Signed-off-by: Avi Kivity

    Joerg Roedel
     
  • The generic x86 code has to know if the specific implementation uses Nested
    Paging. In the generic code Nested Paging is called Two Dimensional Paging
    (TDP) to avoid confusion with (future) TDP implementations of other vendors.
    This patch exports the availability of TDP to the generic x86 code.

    Signed-off-by: Joerg Roedel
    Signed-off-by: Avi Kivity

    Joerg Roedel
     
  • This patch give the SVM and VMX implementations the ability to add some bits
    the guest can set in its EFER register.

    Signed-off-by: Joerg Roedel
    Signed-off-by: Avi Kivity

    Joerg Roedel
     
  • To allow TLB entries to be retained across VM entry and VM exit, the VMM
    can now identify distinct address spaces through a new virtual-processor ID
    (VPID) field of the VMCS.

    [avi: drop vpid_sync_all()]
    [avi: add "cc" to asm constraints]

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • Signed-off-by: Yaozu (Eddie) Dong
    Signed-off-by: Avi Kivity

    Dong, Eddie
     
  • OK, so 25-mm1 gave a lockdep error which made me look into this.

    The first thing that I noticed was the horrible mess; the second thing I
    saw was hacks like: 71e93d15612c61c2e26a169567becf088e71b8ff

    The problem is that arch idle routines are somewhat inconsitent with
    their IRQ state handling and instead of fixing _that_, we go paper over
    the problem.

    So the thing I've tried to do is set a standard for idle routines and
    fix them all up to adhere to that. So the rules are:

    idle routines are entered with IRQs disabled
    idle routines will exit with IRQs enabled

    Nearly all already did this in one form or another.

    Merge the 32 and 64 bit bits so they no longer have different bugs.

    As for the actual lockdep warning; __sti_mwait() did a plainly un-annotated
    irq-enable.

    Signed-off-by: Peter Zijlstra
    Tested-by: Bob Copeland
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • …nux-2.6-x86-bigbox-bootmem-v3

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-bigbox-bootmem-v3:
    x86_64/mm: check and print vmemmap allocation continuous
    x86_64: fix setup_node_bootmem to support big mem excluding with memmap
    x86_64: make reserve_bootmem_generic() use new reserve_bootmem()
    mm: allow reserve_bootmem() cross nodes
    mm: offset align in alloc_bootmem()
    mm: fix alloc_bootmem_core to use fast searching for all nodes
    mm: make mem_map allocation continuous

    Linus Torvalds
     
  • typical case: four sockets system, every node has 4g ram, and we are using:

    memmap=10g$4g

    to mask out memory on node1 and node2

    when numa is enabled, early_node_mem is used to get node_data and node_bootmap.

    if it can not get memory from the same node with find_e820_area(), it will
    use alloc_bootmem to get buff from previous nodes.

    so check it and print out some info about it.

    need to move early_res_to_bootmem into every setup_node_bootmem.
    and it takes range that node has. otherwise alloc_bootmem could return addr
    that reserved early.

    depends on "mm: make reserve_bootmem can crossed the nodes".

    Signed-off-by: Yinghai Lu
    Signed-off-by: Ingo Molnar

    Yinghai Lu