26 Sep, 2011

1 commit

  • Currently the method of dealing with an IO operation on a bus (PIO/MMIO)
    is to call the read or write callback for each device registered
    on the bus until we find a device which handles it.

    Since the number of devices on a bus can be significant due to ioeventfds
    and coalesced MMIO zones, this leads to a lot of overhead on each IO
    operation.

    Instead of registering devices, we now register ranges which points to
    a device. Lookup is done using an efficient bsearch instead of a linear
    search.

    Performance test was conducted by comparing exit count per second with
    200 ioeventfds created on one byte and the guest is trying to access a
    different byte continuously (triggering usermode exits).
    Before the patch the guest has achieved 259k exits per second, after the
    patch the guest does 274k exits per second.

    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Sasha Levin
    Signed-off-by: Avi Kivity

    Sasha Levin
     

08 Apr, 2011

1 commit


06 Apr, 2011

1 commit

  • irqfd in kvm used flush_work incorrectly: it assumed that work scheduled
    previously can't run after flush_work, but since kvm uses a non-reentrant
    workqueue (by means of schedule_work) we need flush_work_sync to get that
    guarantee.

    Signed-off-by: Michael S. Tsirkin
    Reported-by: Jean-Philippe Menil
    Tested-by: Jean-Philippe Menil
    Signed-off-by: Avi Kivity

    Michael S. Tsirkin
     

31 Mar, 2011

1 commit


18 Mar, 2011

1 commit


12 Jan, 2011

1 commit

  • Store irq routing table pointer in the irqfd object,
    and use that to inject MSI directly without bouncing out to
    a kernel thread.

    While we touch this structure, rearrange irqfd fields to make fastpath
    better packed for better cache utilization.

    This also adds some comments about locking rules and rcu usage in code.

    Some notes on the design:
    - Use pointer into the rt instead of copying an entry,
    to make it possible to use rcu, thus side-stepping
    locking complexities. We also save some memory this way.
    - Old workqueue code is still used for level irqs.
    I don't think we DTRT with level anyway, however,
    it seems easier to keep the code around as
    it has been thought through and debugged, and fix level later than
    rip out and re-instate it later.

    Signed-off-by: Michael S. Tsirkin
    Acked-by: Marcelo Tosatti
    Acked-by: Gregory Haskins
    Signed-off-by: Avi Kivity

    Michael S. Tsirkin
     

23 Sep, 2010

1 commit

  • I think I see the following (theoretical) race:

    During irqfd assign, we drop irqfds lock before we
    schedule inject work. Therefore, deassign running
    on another CPU could cause shutdown and flush to run
    before inject, causing user after free in inject.

    A simple fix it to schedule inject under the lock.

    Signed-off-by: Michael S. Tsirkin
    Acked-by: Gregory Haskins
    Signed-off-by: Marcelo Tosatti

    Michael S. Tsirkin
     

01 Aug, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

01 Mar, 2010

3 commits


25 Jan, 2010

2 commits

  • kvm didn't clear irqfd counter on deassign, as a result we could get a
    spurious interrupt when irqfd is assigned back. this leads to poor
    performance and, in theory, guest crash.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Avi Kivity

    Michael S. Tsirkin
     
  • Looks like repeatedly binding same fd to multiple gsi's with irqfd can
    use up a ton of kernel memory for irqfd structures.

    A simple fix is to allow each fd to only trigger one gsi: triggering a
    storm of interrupts in guest is likely useless anyway, and we can do it
    by binding a single gsi to many interrupts if we really want to.

    Cc: stable@kernel.org
    Signed-off-by: Michael S. Tsirkin
    Acked-by: Acked-by: Gregory Haskins
    Signed-off-by: Avi Kivity

    Michael S. Tsirkin
     

03 Dec, 2009

1 commit


10 Sep, 2009

4 commits

  • This code is not executed before file has been initialized to the result of
    calling eventfd_fget. This function returns an ERR_PTR value in an error
    case instead of NULL. Thus the test that file is not NULL is always true.

    A simplified version of the semantic match that finds this problem is as
    follows: (http://coccinelle.lip6.fr/)

    //
    @match exists@
    expression x, E;
    statement S1, S2;
    @@

    x = eventfd_fget(...)
    ... when != x = E
    (
    * if (x == NULL || ...) S1 else S2
    |
    * if (x == NULL && ...) S1 else S2
    )
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Avi Kivity

    Julia Lawall
     
  • ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
    signal when written to by a guest. Host userspace can register any
    arbitrary IO address with a corresponding eventfd and then pass the eventfd
    to a specific end-point of interest for handling.

    Normal IO requires a blocking round-trip since the operation may cause
    side-effects in the emulated model or may return data to the caller.
    Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
    "heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
    device model synchronously before returning control back to the vcpu.

    However, there is a subclass of IO which acts purely as a trigger for
    other IO (such as to kick off an out-of-band DMA request, etc). For these
    patterns, the synchronous call is particularly expensive since we really
    only want to simply get our notification transmitted asychronously and
    return as quickly as possible. All the sychronous infrastructure to ensure
    proper data-dependencies are met in the normal IO case are just unecessary
    overhead for signalling. This adds additional computational load on the
    system, as well as latency to the signalling path.

    Therefore, we provide a mechanism for registration of an in-kernel trigger
    point that allows the VCPU to only require a very brief, lightweight
    exit just long enough to signal an eventfd. This also means that any
    clients compatible with the eventfd interface (which includes userspace
    and kernelspace equally well) can now register to be notified. The end
    result should be a more flexible and higher performance notification API
    for the backend KVM hypervisor and perhipheral components.

    To test this theory, we built a test-harness called "doorbell". This
    module has a function called "doorbell_ring()" which simply increments a
    counter for each time the doorbell is signaled. It supports signalling
    from either an eventfd, or an ioctl().

    We then wired up two paths to the doorbell: One via QEMU via a registered
    io region and through the doorbell ioctl(). The other is direct via
    ioeventfd.

    You can download this test harness here:

    ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2

    The measured results are as follows:

    qemu-mmio: 110000 iops, 9.09us rtt
    ioeventfd-mmio: 200100 iops, 5.00us rtt
    ioeventfd-pio: 367300 iops, 2.72us rtt

    I didn't measure qemu-pio, because I have to figure out how to register a
    PIO region with qemu's device model, and I got lazy. However, for now we
    can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
    and -350ns for HC, we get:

    qemu-pio: 153139 iops, 6.53us rtt
    ioeventfd-hc: 412585 iops, 2.37us rtt

    these are just for fun, for now, until I can gather more data.

    Here is a graph for your convenience:

    http://developer.novell.com/wiki/images/7/76/Iofd-chart.png

    The conclusion to draw is that we save about 4us by skipping the userspace
    hop.

    --------------------

    Signed-off-by: Gregory Haskins
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Avi Kivity

    Gregory Haskins
     
  • Protect irq injection/acking data structures with a separate irq_lock
    mutex. This fixes the following deadlock:

    CPU A CPU B
    kvm_vm_ioctl_deassign_dev_irq()
    mutex_lock(&kvm->lock); worker_thread()
    -> kvm_deassign_irq() -> kvm_assigned_dev_interrupt_work_handler()
    -> deassign_host_irq() mutex_lock(&kvm->lock);
    -> cancel_work_sync() [blocked]

    [gleb: fix ia64 path]

    Reported-by: Alex Williamson
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Gleb Natapov
    Signed-off-by: Avi Kivity

    Marcelo Tosatti
     
  • KVM provides a complete virtual system environment for guests, including
    support for injecting interrupts modeled after the real exception/interrupt
    facilities present on the native platform (such as the IDT on x86).
    Virtual interrupts can come from a variety of sources (emulated devices,
    pass-through devices, etc) but all must be injected to the guest via
    the KVM infrastructure. This patch adds a new mechanism to inject a specific
    interrupt to a guest using a decoupled eventfd mechnanism: Any legal signal
    on the irqfd (using eventfd semantics from either userspace or kernel) will
    translate into an injected interrupt in the guest at the next available
    interrupt window.

    Signed-off-by: Gregory Haskins
    Signed-off-by: Avi Kivity

    Gregory Haskins