27 Aug, 2010

1 commit

  • These days the headers we use are in glibc. If those are too old, you can
    add the -I lines to get the kernel headers.

    In file included from ../../include/linux/if_tun.h:19,
    from lguest.c:33:
    ../../include/linux/types.h:13:2: warning: #warning "Attempt to use kernel headers from user space, see http://kernelnewbies.org/KernelHeaders"
    lguest.c: In function ‘setup_tun_net’:
    lguest.c:1456: warning: dereferencing pointer ‘sin’ does break strict-aliasing rules
    lguest.c:1457: warning: dereferencing pointer ‘sin’ does break strict-aliasing rules
    lguest.c:1450: note: initialized from here

    Signed-off-by: Rusty Russell

    Rusty Russell
     

23 Apr, 2010

1 commit


24 Feb, 2010

1 commit


04 Dec, 2009

1 commit

  • That is "success", "unknown", "through", "performance", "[re|un]mapping"
    , "access", "default", "reasonable", "[con]currently", "temperature"
    , "channel", "[un]used", "application", "example","hierarchy", "therefore"
    , "[over|under]flow", "contiguous", "threshold", "enough" and others.

    Signed-off-by: André Goddard Rosa
    Signed-off-by: Jiri Kosina

    André Goddard Rosa
     

22 Oct, 2009

1 commit

  • Rusty,

    commit 3ca4f5ca73057a617f9444a91022d7127041970a
    virtio: add virtio IDs file
    moved all device IDs into a single file. While the change itself is
    a very good one, it can break userspace applications. For example
    if a userspace tool wanted to get the ID of virtio_net it used to
    include virtio_net.h. This does no longer work, since virtio_net.h
    does not include virtio_ids.h.
    This patch moves all "#include " from the C
    files into the header files, making the header files compatible with
    the old ones.

    In addition, this patch exports virtio_ids.h to userspace.

    CC: Fernando Luis Vazquez Cao
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Rusty Russell

    Christian Borntraeger
     

23 Sep, 2009

2 commits


30 Jul, 2009

4 commits

  • I've been doing this for years, and akpm picked me up on it about 12
    months ago. lguest partly serves as example code, so let's do it Right.

    Also, remove two unused fields in struct vblk_info in the example launcher.

    Signed-off-by: Rusty Russell
    Cc: Ingo Molnar

    Rusty Russell
     
  • Every so often, after code shuffles, I need to go through and unbitrot
    the Lguest Journey (see drivers/lguest/README). Since we now use RCU in
    a simple form in one place I took the opportunity to expand that explanation.

    Signed-off-by: Rusty Russell
    Cc: Ingo Molnar
    Cc: Paul McKenney

    Rusty Russell
     
  • I don't really notice it (except to begrudge the extra vertical
    space), but Ingo does. And he pointed out that one excuse of lguest
    is as a teaching tool, it should set a good example.

    Signed-off-by: Rusty Russell
    Cc: Ingo Molnar

    Rusty Russell
     
  • 1d589bb16b825b3a7b4edd34d997f1f1f953033d "Add serial number support
    for virtio_blk, V4a" extended 'struct virtio_blk_config' to 536 bytes.
    Lguest and S/390 both use an 8 bit value for the feature length, and
    this change broke them (if the code is naive).

    Signed-off-by: Rusty Russell
    Cc: John Cooper
    Cc: Christian Borntraeger

    Rusty Russell
     

12 Jun, 2009

12 commits

  • Support the VIRTIO_RING_F_INDIRECT_DESC feature.

    This is a simple matter of changing the descriptor walking
    code to operate on a struct vring_desc* and supplying it
    with an indirect table if detected.

    Signed-off-by: Mark McLoughlin
    Signed-off-by: Rusty Russell

    Mark McLoughlin
     
  • The Guest only really needs to tell us about activity when we're going
    to listen to the eventfd: normally, we don't want to know.

    So if there are no available buffers, turn on notifications, re-check,
    then wait for the Guest to notify us via the eventfd, then turn
    notifications off again.

    There's enough else going on that the differences are in the noise.

    Before: Secs RxKicks TxKicks
    1G TCP Guest->Host: 3.94 4686 32815
    1M normal pings: 104 142862 1000010
    1M 1k pings (-l 120): 57 142026 1000007

    After:
    1G TCP Guest->Host: 3.76 4691 32811
    1M normal pings: 111 142859 997467
    1M 1k pings (-l 120): 55 19648 501549

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Rather than triggering an interrupt every time, we only trigger an
    interrupt when there are no more incoming packets (or the recv queue
    is full).

    However, the overhead of doing the select to figure this out is
    measurable: 1M pings goes from 98 to 104 seconds, and 1G Guest->Host
    TCP goes from 3.69 to 3.94 seconds. It's close to the noise though.

    I tested various timeouts, including reducing it as the number of
    pending packets increased, timing a 1 gigabyte TCP send from Guest ->
    Host and Host -> Guest (GSO disabled, to increase packet rate).

    // time tcpblast -o -s 65536 -c 16k 192.168.2.1:9999 > /dev/null

    Timeout Guest->Host Pkts/irq Host->Guest Pkts/irq
    Before 11.3s 1.0 6.3s 1.0
    0 11.7s 1.0 6.6s 23.5
    1 17.1s 8.8 8.6s 26.0
    1/pending 13.4s 1.9 6.6s 23.8
    2/pending 13.6s 2.8 6.6s 24.1
    5/pending 14.1s 5.0 6.6s 24.4

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • If we track how many buffers we've used, we can tell whether we really
    need to interrupt the Guest. This happens as a side effect of
    spurious notifications.

    Spurious notifications happen because it can take a while before the
    Host thread wakes up and sets the VRING_USED_F_NO_NOTIFY flag, and
    meanwhile the Guest can more notifications.

    A real fix would be to use wake counts, rather than a suppression
    flag, but the practical difference is generally in the noise: the
    interrupt is usually coalesced into a pending one anyway so we just
    save a system call which isn't clearly measurable.

    Secs Spurious IRQS
    1G TCP Guest->Host: 3.93 58
    1M normal pings: 100 72
    1M 1k pings (-l 120): 57 492904

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Rather than sending an interrupt on every buffer, we only send an interrupt
    when we're about to wait for the Guest to send us a new one. The console
    input and network input still send interrupts manually, but the block device,
    network and console output queues can simply rely on this logic to send
    interrupts to the Guest at the right time.

    The patch is cluttered by moving trigger_irq() higher in the code.

    In practice, two factors make this optimization less interesting:
    (1) we often only get one input at a time, even for networking,
    (2) triggering an interrupt rapidly tends to get coalesced anyway.

    Before: Secs RxIRQS TxIRQs
    1G TCP Guest->Host: 3.72 32784 32771
    1M normal pings: 99 1000004 995541
    100,000 1k pings (-l 120): 5 49510 49058

    After:
    1G TCP Guest->Host: 3.69 32809 32769
    1M normal pings: 99 1000004 996196
    100,000 1k pings (-l 120): 5 52435 52361

    (Note the interrupt count on 100k pings goes *up*: see next patch).

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Currently lguest has three threads: the main Launcher thread, a Waker
    thread, and a thread for the block device (because synchronous block
    was simply too painful to bear).

    The Waker selects() on all the input file descriptors (eg. stdin, net
    devices, pipe to the block thread) and when one becomes readable it calls
    into the kernel to kick the Launcher thread out into userspace, which
    repeats the poll, services the device(s), and then tells the kernel to
    release the Waker before re-entering the kernel to run the Guest.

    Also, to make a slightly-decent network transmit routine, the Launcher
    would suppress further network interrupts while it set a timer: that
    signal handler would write to a pipe, which would rouse the Waker
    which would prod the Launcher out of the kernel to check the network
    device again.

    Now we can convert all our virtqueues to separate threads: each one has
    a separate eventfd for when the Guest pokes the device, and can trigger
    interrupts in the Guest directly.

    The linecount shows how much this simplifies, but to really bring it
    home, here's an strace analysis of single Guest->Host ping before:

    * Guest sends packet, notifies xmit vq, return control to Launcher
    * Launcher clears notification flag on xmit ring
    * Launcher writes packet to TUN device
    writev(4, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"\366\r\224`\2058\272m\224vf\274\10\0E\0\0T\0\0@\0@\1\265"..., 98}], 2) = 108
    * Launcher sets up interrupt for Guest (xmit ring is empty)
    write(10, "\2\0\0\0\3\0\0\0", 8) = 0
    * Launcher sets up timer for interrupt mitigation
    setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 505}}, NULL) = 0
    * Launcher re-runs guest
    pread64(10, 0xbfa5f4d4, 4, 0) ...
    * Waker notices reply packet in tun device (it was in select)
    select(12, [0 3 4 6 11], NULL, NULL, NULL) = 1 (in [4])
    * Waker kicks Launcher out of guest:
    pwrite64(10, "\3\0\0\0\1\0\0\0", 8, 0) = 0
    * Launcher returns from running guest:
    ... = -1 EAGAIN (Resource temporarily unavailable)
    * Launcher looks at input fds:
    select(7, [0 3 4 6], NULL, NULL, {0, 0}) = 1 (in [4], left {0, 0})
    * Launcher reads pong from tun device:
    readv(4, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"\272m\224vf\274\366\r\224`\2058\10\0E\0\0T\364\26\0\0@"..., 1518}], 2) = 108
    * Launcher injects guest notification:
    write(10, "\2\0\0\0\2\0\0\0", 8) = 0
    * Launcher rechecks fds:
    select(7, [0 3 4 6], NULL, NULL, {0, 0}) = 0 (Timeout)
    * Launcher clears Waker:
    pwrite64(10, "\3\0\0\0\0\0\0\0", 8, 0) = 0
    * Launcher reruns Guest:
    pread64(10, 0xbfa5f4d4, 4, 0) = ? ERESTARTSYS (To be restarted)
    * Signal comes in, uses pipe to wake up Launcher:
    --- SIGALRM (Alarm clock) @ 0 (0) ---
    write(8, "\0", 1) = 1
    sigreturn() = ? (mask now [])
    * Waker sees write on pipe:
    select(12, [0 3 4 6 11], NULL, NULL, NULL) = 1 (in [6])
    * Waker kicks Launcher out of Guest:
    pwrite64(10, "\3\0\0\0\1\0\0\0", 8, 0) = 0
    * Launcher exits from kernel:
    pread64(10, 0xbfa5f4d4, 4, 0) = -1 EAGAIN (Resource temporarily unavailable)
    * Launcher looks to see what fd woke it:
    select(7, [0 3 4 6], NULL, NULL, {0, 0}) = 1 (in [6], left {0, 0})
    * Launcher reads timeout fd, sets notification flag on xmit ring
    read(6, "\0", 32) = 1
    * Launcher rechecks fds:
    select(7, [0 3 4 6], NULL, NULL, {0, 0}) = 0 (Timeout)
    * Launcher clears Waker:
    pwrite64(10, "\3\0\0\0\0\0\0\0", 8, 0) = 0
    * Launcher resumes Guest:
    pread64(10, "\0p\0\4", 4, 0) ....

    strace analysis of single Guest->Host ping after:

    * Guest sends packet, notifies xmit vq, creates event on eventfd.
    * Network xmit thread wakes from read on eventfd:
    read(7, "\1\0\0\0\0\0\0\0", 8) = 8
    * Network xmit thread writes packet to TUN device
    writev(4, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"J\217\232FI\37j\27\375\276\0\304\10\0E\0\0T\0\0@\0@\1\265"..., 98}], 2) = 108
    * Network recv thread wakes up from read on tunfd:
    readv(4, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"j\27\375\276\0\304J\217\232FI\37\10\0E\0\0TiO\0\0@\1\214"..., 1518}], 2) = 108
    * Network recv thread sets up interrupt for the Guest
    write(6, "\2\0\0\0\2\0\0\0", 8) = 0
    * Network recv thread goes back to reading tunfd
    13:39:42.460285 readv(4,
    * Network xmit thread sets up interrupt for Guest (xmit ring is empty)
    write(6, "\2\0\0\0\3\0\0\0", 8) = 0
    * Network xmit thread goes back to reading from eventfd
    read(7,

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • I've never seen it here, but I can't find anywhere that says writev
    will write everything.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • The "len" field in the used ring for virtio indicates the number of
    bytes *written* to the buffer. This means the guest doesn't have to
    zero the buffers in advance as it always knows the used length.

    Erroneously, the console and network example code puts the length
    *read* into that field. The guest ignores it, but it's wrong.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • 20887611523e749d99cc7d64ff6c97d27529fbae (lguest: notify on empty) introduced
    lguest support for the VIRTIO_F_NOTIFY_ON_EMPTY flag, but in fact it turned on
    interrupts all the time.

    Because we always process one buffer at a time, the inflight count is always 0
    when call trigger_irq and so we always ignore VRING_AVAIL_F_NO_INTERRUPT from
    the Guest.

    It should be looking to see if there are more buffers in the Guest's queue:
    if it's empty, then we force an interrupt.

    This makes little difference, since we usually have an empty queue; but
    that's the subject of another patch.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Since the Launcher process runs the Guest, it doesn't have to be very
    serious about its barriers: the Guest isn't running while we are (Guest
    is UP).

    Before we change to use threads to service devices, we need to fix this.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • We hand the /dev/lguest fd everywhere; it's far neater to just make it
    a global (it already is, in fact, hidden in the waker_fds struct).

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • We can't trust the values in the device descriptor table once the
    guest has booted, so keep local copies. They could set them to
    strange values then cause us to segv (they're 8 bit values, so they
    can't make our pointers go too wild).

    This becomes more important with the following patches which read them.

    Signed-off-by: Rusty Russell

    Rusty Russell
     

30 Mar, 2009

1 commit


30 Dec, 2008

2 commits


31 Oct, 2008

1 commit


28 Oct, 2008

1 commit

  • The Documentation/i386 and Documentation/x86_64 directories and their
    contents have been moved into Documentation/x86. Fix references to
    those files accordingly.

    Signed-off-by: Uwe Hermann
    Signed-off-by: Randy Dunlap
    Signed-off-by: Ingo Molnar

    Uwe Hermann
     

25 Aug, 2008

1 commit


12 Aug, 2008

1 commit


29 Jul, 2008

10 commits

  • lguest uses a Waker process to break it out of the kernel (ie.
    actually running the guest) when file descriptor needs attention.

    Changing this from a process to a thread somewhat simplifies things:
    it can directly access the fd_set of things to watch. More
    importantly, it means that the Waker can see Guest memory correctly,
    so /dev/vring file descriptors will work as anticipated (the
    alternative is to actually mmap MAP_SHARED, but you can't do that with
    /dev/zero).

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • With big packets, 128 entries is a little small.

    Guest -> Host 1GB TCP:
    Before: 8.43625 seconds xmit 95640 recv 198266 timeout 49771 usec 1252
    After: 8.01099 seconds xmit 49200 recv 102263 timeout 26014 usec 2118

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Guest -> Host 1GB TCP:
    Before 20.1974 seconds xmit 214510 recv 5 timeout 214491 usec 278
    After 8.43625 seconds xmit 95640 recv 198266 timeout 49771 usec 1252

    Host -> Guest 1GB TCP:
    Before: Seconds 9.98854 xmit 172166 recv 5344 timeout 172157 usec 251
    After: Seconds 5.72803 xmit 244322 recv 9919 timeout 244302 usec 156

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • This warning can happen a lot under load, and it should be warnx not
    warn anwyay.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Since the correct timeout value varies, use a heuristic which adjusts
    the timeout depending on how many packets we've seen. This gives
    slightly worse results, but doesn't need tweaking when GSO is
    introduced.

    500 usec 19.1887 xmit 561141 recv 1 timeout 559657
    Dynamic (278) 20.1974 xmit 214510 recv 5 timeout 214491 usec 278

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • virtio_ring has the ability to suppress notifications. This prevents
    a guest exit for every packet, but we need to set a timer on packet
    receipt to re-check if there were any remaining packets.

    Here are the times for 1G TCP Guest->Host with different timeout
    settings (it matters because the TCP window doesn't grow big enough to
    fill the entire buffer):

    Timeout value Seconds Xmit/Recv/Timeout
    None (before) 25.3784 xmit 7750233 recv 1
    2500 usec 62.5119 xmit 207020 recv 2 timeout 207020
    1000 usec 34.5379 xmit 207003 recv 2 timeout 207003
    750 usec 29.2305 xmit 207002 recv 1 timeout 207002
    500 usec 19.1887 xmit 561141 recv 1 timeout 559657
    250 usec 20.0465 xmit 214128 recv 2 timeout 214110
    100 usec 19.2583 xmit 561621 recv 1 timeout 560153

    (Note that these values are sensitive to the GSO patches which come
    later, and probably other traffic-related variables, so take with a
    large grain of salt).

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Number of exits transmitting 10GB Guest->Host before:
    network xmit 7858610 recv 118136

    After:
    network xmit 7750233 recv 1

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • To simplify the transition to when we publish indices in the ring
    (and make shuffling my patch queue easier), wrap them in a lg_last_avail()
    macro.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • This is a simple patch to add support for the virtio "hardware random
    generator" to lguest. It gets about 1.2 MB/sec reading from /dev/hwrng
    in the guest.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • If you've got a nice DHCP configuration which maps MAC
    addresses to specific IP addresses, then you're going to
    want to start your guest with one of those MAC addresses.

    Also, in Fedora, we have persistent network interface naming
    based on the MAC address, so with randomly assigned
    addresses you're soon going to hit eth13. Who knows what
    will happen then!

    Allow assigning a MAC address to the network interface with
    e.g.

    --tunnet=bridge:eth0:00:FF:95:6B:DA:3D

    or:

    --tunnet=192.168.121.1:00:FF:95:6B:DA:3D

    which is pretty unintelligable, but ...

    (includes Rusty's minor rework)

    Signed-off-by: Mark McLoughlin
    Signed-off-by: Rusty Russell

    Mark McLoughlin