30 May, 2008

9 commits

  • Hello Rusty,

    seems that we still have a problem with virtio_net and the enable_cb callback.
    During a long running network stress tests with virtio and got the following
    oops:

    ------------[ cut here ]------------
    kernel BUG at drivers/virtio/virtio_ring.c:230!
    illegal operation: 0001 [#1] SMP
    Modules linked in:
    CPU: 0 Not tainted 2.6.26-rc2-kvm-00436-gc94c08b-dirty #34
    Process netserver (pid: 2582, task: 000000000fbc4c68, ksp: 000000000f42b990)
    Krnl PSW : 0704c00180000000 00000000002d0ec8 (vring_enable_cb+0x1c/0x60)
    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
    Krnl GPRS: 0000000000000000 0000000000000000 000000000ef3d000 0000000010009800
    0000000000000000 0000000000419ce0 0000000000000080 000000000000007b
    000000000adb5538 000000000ef40900 000000000ef40000 000000000ef40920
    0000000000000000 0000000000000005 000000000029c1b0 000000000fea7d18
    Krnl Code: 00000000002d0ebc: a7110001 tmll %r1,1
    00000000002d0ec0: a7740004 brc 7,2d0ec8
    00000000002d0ec4: a7f40001 brc 15,2d0ec6
    >00000000002d0ec8: a517fffe nill %r1,65534
    00000000002d0ecc: 40103000 sth %r1,0(%r3)
    00000000002d0ed0: 07f0 bcr 15,%r0
    00000000002d0ed2: e31020380004 lg %r1,56(%r2)
    00000000002d0ed8: a7480000 lhi %r4,0
    Call Trace:
    ([] virtnet_poll+0x290/0x3b8)
    [] net_rx_action+0x9c/0x1b8
    [] __do_softirq+0x74/0x108
    [] do_softirq+0x92/0xac
    [] irq_exit+0x72/0xc8
    [] do_extint+0xe2/0x104
    [] ext_no_vtime+0x16/0x1a
    Last Breaking-Event-Address:
    [] vring_enable_cb+0x18/0x60

    I looked into the virtio_net code for some time and I think the following
    scenario happened. Please look at virtnet_poll:
    [...]
    /* Out of packets? */
    if (received < budget) {
    netif_rx_complete(vi->dev, napi);
    if (unlikely(!vi->rvq->vq_ops->enable_cb(vi->rvq))
    && napi_schedule_prep(napi)) {
    vi->rvq->vq_ops->disable_cb(vi->rvq);
    __netif_rx_schedule(vi->dev, napi);
    goto again;
    }
    }

    If an interrupt arrives after netif_rx_complete, a second poll routine can run
    on a different cpu. The second check for napi_schedule_prep would prevent any
    harm in the network stack, but we have called enable_cb possibly after the
    disable_cb in skb_recv_done.

    static void skb_recv_done(struct virtqueue *rvq)
    {
    struct virtnet_info *vi = rvq->vdev->priv;
    /* Schedule NAPI, Suppress further interrupts if successful. */
    if (netif_rx_schedule_prep(vi->dev, &vi->napi)) {
    rvq->vq_ops->disable_cb(rvq);
    __netif_rx_schedule(vi->dev, &vi->napi);
    }
    }

    That means that the second poll routine runs with interrupts enabled, which is
    ok, since we can handle additional interrupts. The problem is now that the
    second poll routine might also call enable_cb, triggering the BUG.

    The only solution I can come up with, is to remove the BUG statement in
    enable_cb - similar to disable_cb. Opinions or better ideas where the oops
    could come from?

    Signed-off-by: Christian Borntraeger
    Signed-off-by: Rusty Russell

    Christian Borntraeger
     
  • Note that by itself, having a "hardware" random generator does very
    little: you should probably run "rngd" in your guest to feed this into
    the kernel entropy pool.

    Included:
    virtio_rng: dont use vmalloced addresses for virtio

    If virtio_rng is build as a module, random_data is an address
    in vmalloc space. As virtio expects guest real addresses, this
    can cause any kind of funny behaviour, so lets allocate
    random_data dynamically with kmalloc.

    Signed-off-by: Christian Borntraeger

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Hello Rusty,

    sometimes it is useful to share a disk (e.g. usr). To avoid file system
    corruption, the disk should be mounted read-only in that case. This patch
    adds a new feature flag, that allows the host to specify, if the disk should
    be considered read-only.

    Signed-off-by: Christian Borntraeger
    Signed-off-by: Rusty Russell

    Christian Borntraeger
     
  • Before:
    root@ubuntu:~# cat /proc/interrupts
    CPU0
    1: 1672 lguest- virtio0
    2: 1 lguest- virtio1
    ...
    After:
    root@ubuntu:~# cat /proc/interrupts
    CPU0
    1: 2889 lguest-level virtio0
    2: 9 lguest-level virtio1

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Anthony Liguori points out that three different transports use the virtio code,
    but each one keeps its own counter to set the virtio_device's index field. In
    theory (though not in current practice) this means that names could be
    duplicated, and that risk grows as more transports are created.

    So we move the selection of the unique virtio_device.index into the common code
    in virtio.c, which has the side-benefit of removing duplicate code.

    The only complexity is that lguest and S/390 use the index to uniquely identify
    the device in case of catastrophic failure before register_virtio_device() is
    called: now we use the offset within the descriptor page as a unique identifier
    for the printks.

    Signed-off-by: Rusty Russell
    Cc: Christian Borntraeger
    Cc: Martin Schwidefsky
    Cc: Carsten Otte
    Cc: Heiko Carstens
    Cc: Chris Lalancette
    Cc: Anthony Liguori

    Rusty Russell
     
  • The common virtio code sets the bus_id, overriding anything virtio_pci
    sets anyway.

    Signed-off-by: Rusty Russell
    Cc: Christian Borntraeger
    Cc: Martin Schwidefsky
    Cc: Carsten Otte
    Cc: Heiko Carstens
    Cc: Chris Lalancette
    Cc: Anthony Liguori

    Rusty Russell
     
  • Chris Lalancette points out that virtio.c sets all device
    names to '0', '1', etc, which looks silly in /proc/interrupts. We change this
    from '%d' to 'virtio%d'.

    Signed-off-by: Rusty Russell
    Cc: Christian Borntraeger
    Cc: Martin Schwidefsky
    Cc: Carsten Otte
    Cc: Heiko Carstens
    Cc: Chris Lalancette
    Cc: Anthony Liguori

    Rusty Russell
     
  • Fix a modprobe virtio_blk ; rmmod virtio_blk ; modprobe virtio_blk crash; this
    was basically because we weren't doing "del_gendisk()" in the remove path.

    Signed-off-by: Chris Lalancette
    Signed-off-by: Rusty Russell (moved del_gendisk up)

    Chris Lalancette
     
  • Thanks to Jon Corbet & LWN. Only took me a day to join the dots.

    Host->Guest netcat before (with unnecessily large receive buffers):
    1073741824 bytes (1.1 GB) copied, 24.7528 seconds, 43.4 MB/s

    After:
    1073741824 bytes (1.1 GB) copied, 17.6369 seconds, 60.9 MB/s

    Signed-off-by: Rusty Russell

    Rusty Russell
     

29 May, 2008

1 commit

  • > +#define ARCH_KMALLOC_MINALIGN (sizeof(long) * 2)
    > +#define ARCH_SLAB_MINALIGN (sizeof(long) * 2)

    This doesn't work if SLAB is selected and slab debugging is enabled as
    these are passed to the preprocessor, and the preprocessor doesn't
    understand sizeof.

    Signed-off-by: Linus Torvalds

    David Howells
     

28 May, 2008

19 commits

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    cfq-iosched: fix RCU problem in cfq_cic_lookup()
    block: make blktrace use per-cpu buffers for message notes
    Added in elevator switch message to blktrace stream
    Added in MESSAGE notes for blktraces
    block: reorder cfq_queue to save space on 64bit builds
    block: Move the second call to get_request to the end of the loop
    splice: handle try_to_release_page() failure
    splice: fix sendfile() issue with relay

    Linus Torvalds
     
  • Specify the minimum slab/kmalloc alignment to be 8 bytes. This fixes a
    crash when SLOB is selected as the memory allocator. The FRV arch needs
    this so that it can use the load- and store-double instructions without
    faulting. By default SLOB sets the minimum to be 4 bytes.

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Fix a typo in the header guard of asm/ipc.h.

    Signed-off-by: Vegard Nossum
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     
  • cfq_cic_lookup() needs to properly protect ioc->ioc_data before
    dereferencing it and also exclude updaters of ioc->ioc_data as well.

    Also add a number of comments documenting why the existing RCU usage
    is OK.

    Thanks a lot to "Paul E. McKenney" for
    review and comments!

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently it uses a single static char array, but that risks
    being corrupted when multiple users issue message notes at the
    same time. Make the buffers dynamically allocated when the trace
    is setup and make them per-cpu instead.

    The default max message size of 1k is also very large, the
    interface is mainly for small text notes. So shrink it to 128 bytes.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Signed-off-by: Alan D. Brunelle
    Signed-off-by: Jens Axboe

    Alan D. Brunelle
     
  • Allows messages to be inserted into blktrace streams.

    Signed-off-by: Alan D. Brunelle
    Signed-off-by: Jens Axboe

    Alan D. Brunelle
     
  • saves 8 bytes of padding & increases objects/slab from 30 to 32 on my
    AMD64 config

    Signed-off-by: Richard Kennedy
    Signed-off-by: Jens Axboe

    Richard Kennedy
     
  • In function get_request_wait, the second call to get_request could be
    moved to the end of the while loop, because if the first call to
    get_request fails, the second call will fail without sleep.

    Signed-off-by: Zhang Yanmin
    Signed-off-by: Jens Axboe

    Zhang, Yanmin
     
  • splice currently assumes that try_to_release_page() always suceeds,
    but it can return failure. If it does, we cannot steal the page.

    Acked-by: Mingming Cao

    Jens Axboe
     
  • Splice isn't always incrementing the ppos correctly, which broke
    relay splice.

    Signed-off-by: Tom Zanussi
    Tested-by: Dan Williams
    Signed-off-by: Jens Axboe

    Tom Zanussi
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
    pciehp: add message about pciehp_slot_with_bus option
    pci hotplug core: add check of duplicate slot name
    pciehp: move msleep after power off
    pciehp: poll cmd completion if hotplug interrupt is disabled
    pciehp: fix slow probing
    pciehp: fix NULL dereference in interrupt handler
    shpchp: add message about shpchp_slot_with_bus option
    PCI: don't enable ASPM on devices with mixed PCIe/PCI functions

    Linus Torvalds
     
  • Some (broken?) platform assign the same slot name to multiple hotplug
    slots. On such system, slot initialization would fail because of name
    collision. The pciehp driver already have a "slot_with_bus" module
    option which adds the bus number into the slot name. This patch adds
    the message about this module option that will be displayed when slot
    name collision is detected.

    Signed-off-by: Kenji Kaneshige
    Signed-off-by: Kristen Carlson Accardi
    Signed-off-by: Jesse Barnes

    Kenji Kaneshige
     
  • Fix the following errors reported by Jan C. Nordholz in
    http://bugzilla.kernel.org/show_bug.cgi?id=10751.

    kobject_add_internal failed for 2 with -EEXIST, don't try to register things with the same name in the same directory.
    Pid: 1, comm: swapper Tainted: G W 2.6.26-rc3 #1
    [] kobject_add_internal+0x140/0x190
    [] kobject_init_and_add+0x2d/0x40
    [] pci_hp_register+0x81/0x2f0
    [] pciehp_probe+0x1a7/0x470
    [] sysfs_add_one+0x44/0xa0
    [] sysfs_addrm_start+0x3f/0xb0
    [] sysfs_create_link+0x8a/0xf0
    [] pcie_port_probe_service+0x50/0x80
    [] driver_sysfs_add+0x55/0x70
    [] driver_probe_device+0x82/0x180
    [] __driver_attach+0x6c/0x70
    [] bus_for_each_dev+0x3a/0x60
    [] pcied_init+0x0/0x80
    [] driver_attach+0x16/0x20
    [] __driver_attach+0x0/0x70
    [] bus_add_driver+0x1a1/0x220
    [] pcied_init+0x0/0x80
    [] driver_register+0x4d/0x120
    [] ibm_acpiphp_init+0x0/0x190
    [] printk+0x1b/0x20
    [] pcied_init+0x0/0x80
    [] pcied_init+0xe/0x80
    [] kernel_init+0x10a/0x300
    [] schedule_tail+0x18/0x50
    [] ret_from_fork+0x6/0x1c
    [] kernel_init+0x0/0x300
    [] kernel_init+0x0/0x300
    [] kernel_thread_helper+0x7/0x1c
    =======================
    pci_hotplug: Unable to register kobject '2'pciehp: pci_hp_register failed with error -22

    Slot with the same name can be registered multiple times if shpchp or
    pciehp driver is loaded after acpiphp is loaded because ACPI based
    hotplug driver and Native OS hotplug driver trying to handle the same
    physical slot. In this case, current pci_hotplug core will call
    kobject_init_and_add() muliple time with the same name. This is the
    cause of this problem. To fix this problem, this patch adds the check
    into pci_hp_register() to see if the slot with the same name.

    Signed-off-by: Kenji Kaneshige
    Signed-off-by: Kristen Carlson Accardi
    Signed-off-by: Jesse Barnes

    Kenji Kaneshige
     
  • According to the PCI Express specification, we must wait for at least
    1 second after turning power off before taking any action that relies
    on power having been removed from the slot/adapter. For this, current
    pciehp wait for 1 second after issuing the power off command in
    hpc_power_off_slot() function. But waiting for 1 second in
    hpc_power_off_slot() can make pciehp probing slow-down because pciehp
    probe code calls hpc_power_off_slot() if the slot is not occupied just
    in case. We don't need to wait for 1 second at the pciehp probe time
    because there is no action on that empty slot. So move 1 second wait
    from hpc_power_off_slot() to the caller of hpc_power_off_slot().

    Signed-off-by: Kenji Kaneshige
    Signed-off-by: Kristen Carlson Accardi
    Signed-off-by: Jesse Barnes

    Kenji Kaneshige
     
  • Fix improper long wait for command completion in pciehp probing.

    As described in PCI Express specification, software notification is
    not generated if the command that occurs as a result of a write to the
    Slot Control register that disables software notification of command
    completed events. Since pciehp driver doesn't take it into account,
    such command is issued in pciehp probing, and it causes improper long
    wait for command completion.

    This patch changes the pciehp driver to take such command into
    account.

    Signed-off-by: Kenji Kaneshige
    Signed-off-by: Kristen Carlson Accardi
    Signed-off-by: Jesse Barnes

    Kenji Kaneshige
     
  • Fix the "pciehp probing slow" problem reported from Jan C. Nordholz in
    http://bugzilla.kernel.org/show_bug.cgi?id=10751.

    The command completed bit in Slot Status register applies only to
    commands issued to control the attention indicator, power indicator,
    power controller, or electromechanical interlock. However, writes to
    other parts of the Slot Control register would end up writing to the
    control fields. Hence, any write to Slot Control register is
    considered as a command. However, if the controller doesn't support
    any of attention indicator, power indicator, power controller and
    electromechanical interlock, command completed bit would not set in
    writing to Slot Control register. In this case, we should not wait for
    command completed bit set, otherwise all commands would be considered
    not completed in timeout seconds (1 sec.).

    The cause of the problem is pciehp driver didn't take this situation
    into account. This patch changes pciehp to take it into account. This
    patch also add the check for "No Command Completed Support" bit in
    Slot Capability register. If it is set, we should not wait for command
    completed bit set as well.

    This problem seems to be revealed by the commit
    c27fb883dffe11aa4cb35ecea1fa1832ba45d4da that fixed the bug that
    pciehp did not wait for command completed properly (pciehp just
    ignored the command completion event).

    Signed-off-by: Kenji Kaneshige
    Signed-off-by: Kristen Carlson Accardi
    Signed-off-by: Jesse Barnes

    Kenji Kaneshige
     
  • Fix the following NULL dereference problem reported from Pierre Ossman
    and Ingo Molnar.

    pciehp: HPC vendor_id 8086 device_id 27d0 ss_vid 0 ss_did 0
    pciehp: pciehp_find_slot: slot (device=0x0) not found
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000070
    IP: [] pciehp_handle_presence_change+0x7e/0x113
    PGD 0
    Oops: 0000 [1]
    CPU 0
    Modules linked in:
    Pid: 1, comm: swapper Tainted: G W 2.6.26-rc3-sched-devel.git-00001-g2b99b26-dirty #170
    RIP: 0010:[] [] pciehp_handle_presence_change+0x7e/0x113
    RSP: 0000:ffff81003f83fbb0 EFLAGS: 00010046
    RAX: 0000000000000039 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000046
    RBP: ffff81003f83fbd0 R08: 0000000000000001 R09: ffffffff80245103
    R10: 0000000000000020 R11: 0000000000000000 R12: ffff81003ea53a30
    R13: 0000000000000000 R14: 0000000000000011 R15: ffffffff80495926
    FS: 0000000000000000(0000) GS:ffffffff80be7400(0000) knlGS:0000000000000000
    CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    CR2: 0000000000000070 CR3: 0000000000201000 CR4: 00000000000006a0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process swapper (pid: 1, threadinfo ffff81003f83e000, task ffff81003f840000)
    Stack: 0000000000000008 ffff81003f83fbf6 ffff81003ea53a30 0000000000000008
    ffff81003f83fc10 ffffffff80495ab4 0000000000000011 0000000000000002
    0000000000000202 0000000000000202 00000000fffffff4 ffff81003ea53a30
    Call Trace:
    [] pcie_isr+0x18e/0x1bc
    [] request_irq+0x106/0x12f
    [] pcie_init+0x15e/0x6cc
    [] pciehp_probe+0x64/0x541
    [] pcie_port_probe_service+0x4c/0x76
    [] driver_probe_device+0xd4/0x1f0
    [] __driver_attach+0x7c/0x7e
    [] ? __driver_attach+0x0/0x7e
    [] bus_for_each_dev+0x53/0x7d
    [] driver_attach+0x1c/0x1e
    [] bus_add_driver+0xdd/0x25b
    [] ? pcied_init+0x0/0x8b
    [] driver_register+0x5f/0x13e
    [] ? pcied_init+0x0/0x8b
    [] pcie_port_service_register+0x47/0x49
    [] pcied_init+0x15/0x8b
    [] kernel_init+0x75/0x243
    [] ? _spin_unlock_irq+0x2b/0x3a
    [] ? finish_task_switch+0x57/0x9a
    [] child_rip+0xa/0x12
    [] ? restore_args+0x0/0x30
    [] ? kernel_init+0x0/0x243
    [] ? child_rip+0x0/0x12

    Code: 83 80 00 00 00 48 39 f0 75 e1 0f b6 c9 48 c7 c2 00 0e 8d 80 48 c7 c6 8a 60 a6 80 48 c7 c7 10 db a8 80 31 c0 e8 3f 8d d9 ff 31 db 8b 43 70 48 8d 75 ef 48 89 df ff 50 30 80 7d ef 00 74 37 48
    RIP [] pciehp_handle_presence_change+0x7e/0x113
    RSP
    CR2: 0000000000000070
    Kernel panic - not syncing: Fatal exception

    The situation under which it occurs is hw and timing related: it appears
    to happen on a system that has PCI hotplug hardware but with no active
    hotplug cards, and another interrupt in the same (shared) IRQ line
    arrives too early, before the hotplug-slot entry has been set up - as
    triggered by CONFIG_DEBUG_SHIRQ=y:

    This patch contains the following two fixes.

    (1) Clear all events bits in Slot Status register to prevent the pciehp
    driver from detecting the spurious events that would have been occur
    before pciehp loading.

    (2) Add check whether slot initialization had been already done.

    This is short term fix. We need more structural fixes to install
    interrupt handler after slot initialization is done.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Kenji Kaneshige
    Signed-off-by: Kristen Carlson Accardi
    Signed-off-by: Jesse Barnes

    Kenji Kaneshige
     
  • Some (broken?) platform assign the same slot name to multiple hotplug
    slots. On such system, slot initialization would fail because of name
    collision. The shpchp driver already have a "slot_with_bus" module
    option which adds the bus number into the slot name. This patch adds
    the message about this module option that will be displayed when slot
    name collision is detected.

    Signed-off-by: Kenji Kaneshige
    Signed-off-by: Kristen Carlson Accardi
    Signed-off-by: Jesse Barnes

    Kenji Kaneshige
     

27 May, 2008

11 commits