31 Dec, 2008

40 commits

  • 1. Increase the size of data area to 64M
    2. Support more vcpus and memory, 128 vcpus and 256G memory are supported
    for guests.
    3. Add the boundary check for memory and vcpu allocation.

    With this patch, kvm guest's data area looks as follow:
    *
    * +----------------------+ ------- KVM_VM_DATA_SIZE
    * | vcpu[n]'s data | | ___________________KVM_STK_OFFSET
    * | | | / |
    * | .......... | | /vcpu's struct&stack |
    * | .......... | | /---------------------|---- 0
    * | vcpu[5]'s data | | / vpd |
    * | vcpu[4]'s data | |/-----------------------|
    * | vcpu[3]'s data | / vtlb |
    * | vcpu[2]'s data | /|------------------------|
    * | vcpu[1]'s data |/ | vhpt |
    * | vcpu[0]'s data |____________________________|
    * +----------------------+ |
    * | memory dirty log | |
    * +----------------------+ |
    * | vm's data struct | |
    * +----------------------+ |
    * | | |
    * | | |
    * | | |
    * | | |
    * | | |
    * | | |
    * | | |
    * | vm's p2m table | |
    * | | |
    * | | |
    * | | | |
    * vm's data->| | | |
    * +----------------------+ ------- 0
    * To support large memory, needs to increase the size of p2m.
    * To support more vcpus, needs to ensure it has enough space to
    * hold vcpus' data.
    */

    Signed-off-by: Xiantao Zhang
    Signed-off-by: Avi Kivity

    Xiantao Zhang
     
  • If emulate_invalid_guest_state is enabled, the emulator is called
    when guest state is invalid. Until now, we reported an mmio failure
    when emulate_instruction() returned EMULATE_DO_MMIO. This patch adds
    the case where emulate_instruction() failed and an MMIO emulation
    is needed.

    Signed-off-by: Guillaume Thouvenin
    Signed-off-by: Avi Kivity

    Guillaume Thouvenin
     
  • If we call the emulator we shouldn't call skip_emulated_instruction()
    in the first place, since the emulator already computes the next rip
    for us. Thus we move ->skip_emulated_instruction() out of
    kvm_emulate_pio() and into handle_io() (and the svm equivalent). We
    also replaced "return 0" by "break" in the "do_io:" case because now
    the shadow register state needs to be committed. Otherwise eip will never
    be updated.

    Signed-off-by: Guillaume Thouvenin
    Signed-off-by: Avi Kivity

    Guillaume Thouvenin
     
  • The busy flag of the TR selector is not set by the hardware. This breaks
    migration from amd hosts to intel hosts.

    Signed-off-by: Amit Shah
    Signed-off-by: Avi Kivity

    Amit Shah
     
  • The hardware does not set the 'g' bit of the cs selector and this breaks
    migration from amd hosts to intel hosts. Set this bit if the segment
    limit is beyond 1 MB.

    Signed-off-by: Amit Shah
    Signed-off-by: Avi Kivity

    Amit Shah
     
  • get_segment_descritptor_dtable() contains an obvious type.

    Signed-off-by: Amit Shah
    Signed-off-by: Avi Kivity

    Amit Shah
     
  • Also remove unnecessary parameter of unregister irq ack notifier.

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • As suggested by Avi, this patch introduces a counter of VCPUs that have
    LVT0 set to NMI mode. Only if the counter > 0, we push the PIT ticks via
    all LAPIC LVT0 lines to enable NMI watchdog support.

    Signed-off-by: Jan Kiszka
    Acked-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • This patch refactors the NMI watchdog delivery patch, consolidating
    tests and providing a proper API for delivering watchdog events.

    An included micro-optimization is to check only for apic_hw_enabled in
    kvm_apic_local_deliver (the test for LVT mask is covering the
    soft-disabled case already).

    Signed-off-by: Jan Kiszka
    Acked-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • Add decode entries for 0x04 and 0x05 (ADD) opcodes, execution is already
    implemented.

    Signed-off-by: Guillaume Thouvenin
    Signed-off-by: Avi Kivity

    Guillaume Thouvenin
     
  • PCI device assignment would map guest MMIO spaces as separate slot, so it is
    possible that the device has more than 2 MMIO spaces and overwrite current
    private memslot.

    The patch move private memory slot to the top of userspace visible memory slots.

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • Otherwise set_bit() for private memory slot(above KVM_MEMORY_SLOTS) would
    corrupted memory in 32bit host.

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • Remove one left improper comment of removed CR2.

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • The effective memory type of EPT is the mixture of MSR_IA32_CR_PAT and memory
    type field of EPT entry.

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • For EPT memory type support.

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • GUEST_PAT support is a new feature introduced by Intel Core i7 architecture.
    With this, cpu would save/load guest and host PAT automatically, for EPT memory
    type in guest depends on MSR_IA32_CR_PAT.

    Also add save/restore for MSR_IA32_CR_PAT.

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • As well as reset mmu context when set MTRR.

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • For KVM can reuse the type define, and need them to support shadow MTRR.

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • Prepare for exporting them.

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • Call kvm_arch_vcpu_reset() instead of directly using arch callback.
    The function does additional things.

    Signed-off-by: Gleb Natapov
    Signed-off-by: Avi Kivity

    Gleb Natapov
     
  • Older VMX supporting CPUs do not provide the "Virtual NMI" feature for
    tracking the NMI-blocked state after injecting such events. For now
    KVM is unable to inject NMIs on those CPUs.

    Derived from Sheng Yang's suggestion to use the IRQ window notification
    for detecting the end of NMI handlers, this patch implements virtual
    NMI support without impact on the host's ability to receive real NMIs.
    The downside is that the given approach requires some heuristics that
    can cause NMI nesting in vary rare corner cases.

    The approach works as follows:
    - inject NMI and set a software-based NMI-blocked flag
    - arm the IRQ window start notification whenever an NMI window is
    requested
    - if the guest exits due to an opening IRQ window, clear the emulated
    NMI-blocked flag
    - if the guest net execution time with NMI-blocked but without an IRQ
    window exceeds 1 second, force NMI-blocked reset and inject anyway

    This approach covers most practical scenarios:
    - succeeding NMIs are seperated by at least one open IRQ window
    - the guest may spin with IRQs disabled (e.g. due to a bug), but
    leaving the NMI handler takes much less time than one second
    - the guest does not rely on strict ordering or timing of NMIs
    (would be problematic in virtualized environments anyway)

    Successfully tested with the 'nmi n' monitor command, the kgdbts
    testsuite on smp guests (additional patches required to add debug
    register support to kvm) + the kernel's nmi_watchdog=1, and a Siemens-
    specific board emulation (+ guest) that comes with its own NMI
    watchdog mechanism.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • This patch adds the required bits to the VMX side for user space
    injected NMIs. As with the preexisting in-kernel irqchip support, the
    CPU must provide the "virtual NMI" feature for proper tracking of the
    NMI blocking state.

    Based on the original patch by Sheng Yang.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • Introduces the KVM_NMI IOCTL to the generic x86 part of KVM for
    injecting NMIs from user space and also extends the statistic report
    accordingly.

    Based on the original patch by Sheng Yang.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • Kick the NMI receiving VCPU in case the triggering caller runs in a
    different context.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • Ensure that a VCPU with pending NMIs is considered runnable.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • LINT0 of the LAPIC can be used to route PIT events as NMI watchdog ticks
    into the guest. This patch aligns the in-kernel irqchip emulation with
    the user space irqchip with already supports this feature. The trick is
    to route PIT interrupts to all LAPIC's LVT0 lines.

    Rebased and slightly polished patch originally posted by Sheng Yang.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • Fix NMI injection in real-mode with the same pattern we perform IRQ
    injection.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • do_interrupt_requests and vmx_intr_assist go different way for
    achieving the same: enabling the nmi/irq window start notification.
    Unify their code over enable_{irq|nmi}_window, get rid of a redundant
    call to enable_intr_window instead of direct enable_nmi_window
    invocation and unroll enable_intr_window for both in-kernel and user
    space irq injection accordingly.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • There are currently two ways in VMX to check if an IRQ or NMI can be
    injected:
    - vmx_{nmi|irq}_enabled and
    - vcpu.arch.{nmi|interrupt}_window_open.
    Even worse, one test (at the end of vmx_vcpu_run) uses an inconsistent,
    likely incorrect logic.

    This patch consolidates and unifies the tests over
    {nmi|interrupt}_window_open as cache + vmx_update_window_states
    for updating the cache content.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • CPU reset invalidates pending or already injected NMIs, therefore reset
    the related state variables.

    Based on original patch by Gleb Natapov.

    Signed-off-by: Gleb Natapov
    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • Properly set GUEST_INTR_STATE_NMI and reset nmi_injected when a
    task-switch vmexit happened due to a task gate being used for handling
    NMIs. Also avoid the false warning about valid vectoring info in
    kvm_handle_exit.

    Based on original patch by Gleb Natapov.

    Signed-off-by: Gleb Natapov
    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • irq_window_exits only tracks IRQ window exits due to user space
    requests, nmi_window_exits include all exits. The latter makes more
    sense, so let's adjust irq_window_exits accounting.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • This patch consolidate the emulation of push reg instruction.

    Signed-off-by: Guillaume Thouvenin
    Signed-off-by: Laurent Vivier
    Signed-off-by: Avi Kivity

    Guillaume Thouvenin
     
  • * 'for-linus' of git://oss.sgi.com/xfs/xfs: (184 commits)
    [XFS] Fix race in xfs_write() between direct and buffered I/O with DMAPI
    [XFS] handle unaligned data in xfs_bmbt_disk_get_all
    [XFS] avoid memory allocations in xfs_fs_vcmn_err
    [XFS] Fix speculative allocation beyond eof
    [XFS] Remove XFS_BUF_SHUT() and friends
    [XFS] Use the incore inode size in xfs_file_readdir()
    [XFS] set b_error from bio error in xfs_buf_bio_end_io
    [XFS] use inode_change_ok for setattr permission checking
    [XFS] add a FMODE flag to make XFS invisible I/O less hacky
    [XFS] resync headers with libxfs
    [XFS] simplify projid check in xfs_rename
    [XFS] replace b_fspriv with b_mount
    [XFS] Remove unused tracing code
    [XFS] Remove unnecessary assertion
    [XFS] Remove unused variable in ktrace_free()
    [XFS] Check return value of xfs_buf_get_noaddr()
    [XFS] Fix hang after disallowed rename across directory quota domains
    [XFS] Fix compile with CONFIG_COMPAT enabled
    move inode tracing out of xfs_vnode.
    move vn_iowait / vn_iowake into xfs_aops.c
    ...

    Linus Torvalds
     
  • * git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (70 commits)
    fs/nfs/nfs4proc.c: make nfs4_map_errors() static
    rpc: add service field to new upcall
    rpc: add target field to new upcall
    nfsd: support callbacks with gss flavors
    rpc: allow gss callbacks to client
    rpc: pass target name down to rpc level on callbacks
    nfsd: pass client principal name in rsc downcall
    rpc: implement new upcall
    rpc: store pointer to pipe inode in gss upcall message
    rpc: use count of pipe openers to wait for first open
    rpc: track number of users of the gss upcall pipe
    rpc: call release_pipe only on last close
    rpc: add an rpc_pipe_open method
    rpc: minor gss_alloc_msg cleanup
    rpc: factor out warning code from gss_pipe_destroy_msg
    rpc: remove unnecessary assignment
    NFS: remove unused status from encode routines
    NFS: increment number of operations in each encode routine
    NFS: fix comment placement in nfs4xdr.c
    NFS: fix tabs in nfs4xdr.c
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
    IB/mlx4: Fix reading SL field out of cqe->sl_vid
    RDMA/addr: Fix build breakage when IPv6 is disabled

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (104 commits)
    [SCSI] fcoe: fix configuration problems
    [SCSI] cxgb3i: fix select/depend problem
    [SCSI] fcoe: fix incorrect use of struct module
    [SCSI] cxgb3i: remove use of skb->sp
    [SCSI] cxgb3i: Add cxgb3i iSCSI driver.
    [SCSI] zfcp: Remove unnecessary warning message
    [SCSI] zfcp: Add support for unchained FSF requests
    [SCSI] zfcp: Remove busid macro
    [SCSI] zfcp: remove DID_DID flag
    [SCSI] zfcp: Simplify mask lookups for incoming RSCNs
    [SCSI] zfcp: Remove initial device data from zfcp_data
    [SCSI] zfcp: fix compile warning
    [SCSI] zfcp: Remove adapter list
    [SCSI] zfcp: Simplify SBAL allocation to fix sparse warnings
    [SCSI] zfcp: register with SCSI layer on ccw registration
    [SCSI] zfcp: Fix message line break
    [SCSI] qla2xxx: changes in multiq code
    [SCSI] eata: fix the data buffer accessors conversion regression
    [SCSI] ibmvfc: Improve async event handling
    [SCSI] lpfc : correct printk types on PPC compiles
    ...

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6: (583 commits)
    V4L/DVB (10130): use USB API functions rather than constants
    V4L/DVB (10129): dvb: remove deprecated use of RW_LOCK_UNLOCKED in frontends
    V4L/DVB (10128): modify V4L documentation to be a valid XHTML
    V4L/DVB (10127): stv06xx: Avoid having y unitialized
    V4L/DVB (10125): em28xx: Don't do AC97 vendor detection for i2s audio devices
    V4L/DVB (10124): em28xx: expand output formats available
    V4L/DVB (10123): em28xx: fix reversed definitions of I2S audio modes
    V4L/DVB (10122): em28xx: don't load em28xx-alsa for em2870 based devices
    V4L/DVB (10121): em28xx: remove worthless Pinnacle PCTV HD Mini 80e device profile
    V4L/DVB (10120): em28xx: remove redundant Pinnacle Dazzle DVC 100 profile
    V4L/DVB (10119): em28xx: fix corrupted XCLK value
    V4L/DVB (10118): zoran: fix warning for a variable not used
    V4L/DVB (10116): af9013: Fix gcc false warnings
    V4L/DVB (10111a): usbvideo.h: remove an useless blank line
    V4L/DVB (10111): quickcam_messenger.c: fix a warning
    V4L/DVB (10110): v4l2-ioctl: Fix warnings when using .unlocked_ioctl = __video_ioctl2
    V4L/DVB (10109): anysee: Fix usage of an unitialized function
    V4L/DVB (10104): uvcvideo: Add support for video output devices
    V4L/DVB (10102): uvcvideo: Ignore interrupt endpoint for built-in iSight webcams.
    V4L/DVB (10101): uvcvideo: Fix bulk URB processing when the header is erroneous
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
    net: Fix percpu counters deadlock
    cpumask: prepare for iterators to only go to nr_cpu_ids/nr_cpumask_bits: net
    drivers/net/usb: use USB API functions rather than constants
    cls_cgroup: clean up Kconfig
    cls_cgroup: clean up for cgroup part
    cls_cgroup: fix an oops when removing a cgroup
    EtherExpress16: fix printing timed out status
    mlx4_en: Added "set_ringparam" Ethtool interface implementation
    mlx4_en: Always allocate RX ring for each interrupt vector
    mlx4_en: Verify number of RX rings doesn't exceed MAX_RX_RINGS
    IPVS: Make "no destination available" message more consistent between schedulers
    net: KS8695: removed duplicated #include
    tun: Fix SIOCSIFHWADDR error.
    smsc911x: compile fix re netif_rx signature changes
    netns: foreach_netdev_safe is insufficient in default_device_exit
    net: make xfrm_statistics_seq_show use generic snmp_fold_field
    net: Fix more NAPI interface netdev argument drop fallout.
    net: Fix unused variable warnings in pasemi_mac.c and spider_net.c

    Linus Torvalds