11 Aug, 2008

9 commits

  • Solve this by marking the classes as unused and not printing information
    about the unused classes.

    Reported-by: Eric Sesterhenn
    Signed-off-by: Rabin Vincent
    Acked-by: Huang Ying
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Rabin Vincent
     
  • Lockdep spotted:

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.27-rc1 #270
    -------------------------------------------------------
    qemu-kvm/2033 is trying to acquire lock:
    (&inode->i_data.i_mmap_lock){----}, at: [] mm_take_all_locks+0xc2/0xea

    but task is already holding lock:
    (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&anon_vma->lock){----}:
    [] __lock_acquire+0x11be/0x14d2
    [] lock_acquire+0x5e/0x7a
    [] _spin_lock+0x3b/0x47
    [] vma_adjust+0x200/0x444
    [] split_vma+0x12f/0x146
    [] mprotect_fixup+0x13c/0x536
    [] sys_mprotect+0x1a9/0x21e
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    -> #0 (&inode->i_data.i_mmap_lock){----}:
    [] __lock_acquire+0xedb/0x14d2
    [] lock_release_non_nested+0x1c2/0x219
    [] lock_release+0x127/0x14a
    [] _spin_unlock+0x1e/0x50
    [] mm_drop_all_locks+0x7f/0xb0
    [] do_mmu_notifier_register+0xe2/0x112
    [] mmu_notifier_register+0xe/0x10
    [] kvm_dev_ioctl+0x11e/0x287 [kvm]
    [] vfs_ioctl+0x2a/0x78
    [] do_vfs_ioctl+0x257/0x274
    [] sys_ioctl+0x55/0x78
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    other info that might help us debug this:

    5 locks held by qemu-kvm/2033:
    #0: (&mm->mmap_sem){----}, at: [] do_mmu_notifier_register+0x55/0x112
    #1: (mm_all_locks_mutex){--..}, at: [] mm_take_all_locks+0x34/0xea
    #2: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
    #3: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
    #4: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea

    stack backtrace:
    Pid: 2033, comm: qemu-kvm Not tainted 2.6.27-rc1 #270

    Call Trace:
    [] print_circular_bug_tail+0xb8/0xc3
    [] __lock_acquire+0xedb/0x14d2
    [] ? add_lock_to_list+0x7e/0xad
    [] ? mm_take_all_locks+0x70/0xea
    [] ? mm_take_all_locks+0x70/0xea
    [] lock_release_non_nested+0x1c2/0x219
    [] ? mm_take_all_locks+0xc2/0xea
    [] ? mm_take_all_locks+0xc2/0xea
    [] ? trace_hardirqs_on_caller+0x4d/0x115
    [] ? mm_drop_all_locks+0x7f/0xb0
    [] lock_release+0x127/0x14a
    [] _spin_unlock+0x1e/0x50
    [] mm_drop_all_locks+0x7f/0xb0
    [] do_mmu_notifier_register+0xe2/0x112
    [] mmu_notifier_register+0xe/0x10
    [] kvm_dev_ioctl+0x11e/0x287 [kvm]
    [] ? file_has_perm+0x83/0x8e
    [] vfs_ioctl+0x2a/0x78
    [] do_vfs_ioctl+0x257/0x274
    [] sys_ioctl+0x55/0x78
    [] system_call_fastpath+0x16/0x1b

    Which the locking hierarchy in mm/rmap.c confirms as valid.

    Fix this by first taking all the mapping->i_mmap_lock instances and then
    take all anon_vma->lock instances.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The nesting is correct due to holding mmap_sem, use the new annotation
    to annotate this.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Expose the new lock protection lock.

    This can be used to annotate places where we take multiple locks of the
    same class and avoid deadlocks by always taking another (top-level) lock
    first.

    NOTE: we're still bound to the MAX_LOCK_DEPTH (48) limit.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • On Fri, 2008-08-01 at 16:26 -0700, Linus Torvalds wrote:

    > On Fri, 1 Aug 2008, David Miller wrote:
    > >
    > > Taking more than a few locks of the same class at once is bad
    > > news and it's better to find an alternative method.
    >
    > It's not always wrong.
    >
    > If you can guarantee that anybody that takes more than one lock of a
    > particular class will always take a single top-level lock _first_, then
    > that's all good. You can obviously screw up and take the same lock _twice_
    > (which will deadlock), but at least you cannot get into ABBA situations.
    >
    > So maybe the right thing to do is to just teach lockdep about "lock
    > protection locks". That would have solved the multi-queue issues for
    > networking too - all the actual network drivers would still have taken
    > just their single queue lock, but the one case that needs to take all of
    > them would have taken a separate top-level lock first.
    >
    > Never mind that the multi-queue locks were always taken in the same order:
    > it's never wrong to just have some top-level serialization, and anybody
    > who needs to take locks might as well do , because they sure as
    > hell aren't going to be on _any_ fastpaths.
    >
    > So the simplest solution really sounds like just teaching lockdep about
    > that one special case. It's not "nesting" exactly, although it's obviously
    > related to it.

    Do as Linus suggested. The lock protection lock is called nest_lock.

    Note that we still have the MAX_LOCK_DEPTH (48) limit to consider, so anything
    that spills that it still up shit creek.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Most the free-standing lock_acquire() usages look remarkably similar, sweep
    them into a new helper.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • struct held_lock {
    u64 prev_chain_key; /* 0 8 */
    struct lock_class * class; /* 8 8 */
    long unsigned int acquire_ip; /* 16 8 */
    struct lockdep_map * instance; /* 24 8 */
    int irq_context; /* 32 4 */
    int trylock; /* 36 4 */
    int read; /* 40 4 */
    int check; /* 44 4 */
    int hardirqs_off; /* 48 4 */

    /* size: 56, cachelines: 1 */
    /* padding: 4 */
    /* last cacheline: 56 bytes */
    };

    struct held_lock {
    u64 prev_chain_key; /* 0 8 */
    long unsigned int acquire_ip; /* 8 8 */
    struct lockdep_map * instance; /* 16 8 */
    unsigned int class_idx:11; /* 24:21 4 */
    unsigned int irq_context:2; /* 24:19 4 */
    unsigned int trylock:1; /* 24:18 4 */
    unsigned int read:2; /* 24:16 4 */
    unsigned int check:2; /* 24:14 4 */
    unsigned int hardirqs_off:1; /* 24:13 4 */

    /* size: 32, cachelines: 1 */
    /* padding: 4 */
    /* bit_padding: 13 bits */
    /* last cacheline: 32 bytes */
    };

    [mingo@elte.hu: shrunk hlock->class too]
    [peterz@infradead.org: fixup bit sizes]
    Signed-off-by: Dave Jones
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Dave Jones
     
  • Instead of using a per-rq lock class, use the regular nesting operations.

    However, take extra care with double_lock_balance() as it can release the
    already held rq->lock (and therefore change its nesting class).

    So what can happen is:

    spin_lock(rq->lock); // this rq subclass 0

    double_lock_balance(rq, other_rq);
    // release rq
    // acquire other_rq->lock subclass 0
    // acquire rq->lock subclass 1

    spin_unlock(other_rq->lock);

    leaving you with rq->lock in subclass 1

    So a subsequent double_lock_balance() call can try to nest a subclass 1
    lock while already holding a subclass 1 lock.

    Fix this by introducing double_unlock_balance() which releases the other
    rq's lock, but also re-sets the subclass for this rq's lock to 0.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • this can be used to reset a held lock's subclass, for arbitrary-depth
    iterated data structures such as trees or lists which have per-node
    locks.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

01 Aug, 2008

3 commits

  • While thinking about David's graph walk lockdep patch it _finally_
    dawned on me that there is no reason we have a lock class per cpu ...

    Sorry for being dense :-/

    The below changes the annotation from a lock class per cpu, to a single
    nested lock, as the scheduler never holds more that 2 rq locks at a time
    anyway.

    If there was code requiring holding all rq locks this would not work and
    the original annotation would be the only option, but that not being the
    case, this is a much lighter one.

    Compiles and boots on a 2-way x86_64.

    Signed-off-by: Peter Zijlstra
    Cc: David Miller
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Otherwise lock debugging messages on runqueue locks can deadlock the
    system due to the wakeups performed by printk().

    Signed-off-by: David S. Miller
    Signed-off-by: Ingo Molnar

    David Miller
     
  • When we traverse the graph, either forwards or backwards, we
    are interested in whether a certain property exists somewhere
    in a node reachable in the graph.

    Therefore it is never necessary to traverse through a node more
    than once to get a correct answer to the given query.

    Take advantage of this property using a global ID counter so that we
    need not clear all the markers in all the lock_class entries before
    doing a traversal. A new ID is choosen when we start to traverse, and
    we continue through a lock_class only if it's ID hasn't been marked
    with the new value yet.

    This short-circuiting is essential especially for high CPU count
    systems. The scheduler has a runqueue per cpu, and needs to take
    two runqueue locks at a time, which leads to long chains of
    backwards and forwards subgraphs from these runqueue lock nodes.
    Without the short-circuit implemented here, a graph traversal on
    a runqueue lock can take up to (1 << (N - 1)) checks on a system
    with N cpus.

    For anything more than 16 cpus or so, lockdep will eventually bring
    the machine to a complete standstill.

    Signed-off-by: David S. Miller
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    David Miller
     

29 Jul, 2008

28 commits

  • Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    lguest: turn Waker into a thread, not a process
    lguest: Enlarge virtio rings
    lguest: Use GSO/IFF_VNET_HDR extensions on tun/tap
    lguest: Remove 'network: no dma buffer!' warning
    lguest: Adaptive timeout
    lguest: Tell Guest net not to notify us on every packet xmit
    lguest: net block unneeded receive queue update notifications
    lguest: wrap last_avail accesses.
    lguest: use cpu capability accessors
    lguest: virtio-rng support
    lguest: Support assigning a MAC address
    lguest: Don't leak /dev/zero fd
    lguest: fix verbose printing of device features.
    lguest: fix switcher_page leak on unload
    lguest: Guest int3 fix
    lguest: set max_pfn_mapped, growl loudly at Yinghai Lu

    Linus Torvalds
     
  • * 'for-linus' of git://git.o-hand.com/linux-mfd:
    mfd: accept pure device as a parent, not only platform_device
    mfd: add platform_data to mfd_cell
    mfd: Coding style fixes
    mfd: Use to_platform_device instead of container_of

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (21 commits)
    x86/PCI: use dev_printk when possible
    PCI: add D3 power state avoidance quirk
    PCI: fix bogus "'device' may be used uninitialized" warning in pci_slot
    PCI: add an option to allow ASPM enabled forcibly
    PCI: disable ASPM on pre-1.1 PCIe devices
    PCI: disable ASPM per ACPI FADT setting
    PCI MSI: Don't disable MSIs if the mask bit isn't supported
    PCI: handle 64-bit resources better on 32-bit machines
    PCI: rewrite PCI BAR reading code
    PCI: document pci_target_state
    PCI hotplug: fix typo in pcie hotplug output
    x86 gart: replace to_pages macro with iommu_num_pages
    x86, AMD IOMMU: replace to_pages macro with iommu_num_pages
    iommu: add iommu_num_pages helper function
    dma-coherent: add documentation to new interfaces
    Cris: convert to using generic dma-coherent mem allocator
    Sh: use generic per-device coherent dma allocator
    ARM: support generic per-device coherent dma mem
    Generic dma-coherent: fix DMA_MEMORY_EXCLUSIVE
    x86: use generic per-device dma coherent allocator
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6:
    [SCSI] qla2xxx: fix msleep compile error

    Linus Torvalds
     
  • Alexey Dobriyan reported trouble with LTP with the new fast-gup code,
    and Johannes Weiner debugged it to non-page-aligned addresses, where the
    new get_user_pages_fast() code would do all the wrong things, including
    just traversing past the end of the requested area due to 'addr' never
    matching 'end' exactly.

    This is not a pretty fix, and we may actually want to move the alignment
    into generic code, leaving just the core code per-arch, but Alexey
    verified that the vmsplice01 LTP test doesn't crash with this.

    Reported-and-tested-by: Alexey Dobriyan
    Debugged-by: Johannes Weiner
    Cc: Nick Piggin
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • lguest uses a Waker process to break it out of the kernel (ie.
    actually running the guest) when file descriptor needs attention.

    Changing this from a process to a thread somewhat simplifies things:
    it can directly access the fd_set of things to watch. More
    importantly, it means that the Waker can see Guest memory correctly,
    so /dev/vring file descriptors will work as anticipated (the
    alternative is to actually mmap MAP_SHARED, but you can't do that with
    /dev/zero).

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • With big packets, 128 entries is a little small.

    Guest -> Host 1GB TCP:
    Before: 8.43625 seconds xmit 95640 recv 198266 timeout 49771 usec 1252
    After: 8.01099 seconds xmit 49200 recv 102263 timeout 26014 usec 2118

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Guest -> Host 1GB TCP:
    Before 20.1974 seconds xmit 214510 recv 5 timeout 214491 usec 278
    After 8.43625 seconds xmit 95640 recv 198266 timeout 49771 usec 1252

    Host -> Guest 1GB TCP:
    Before: Seconds 9.98854 xmit 172166 recv 5344 timeout 172157 usec 251
    After: Seconds 5.72803 xmit 244322 recv 9919 timeout 244302 usec 156

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • This warning can happen a lot under load, and it should be warnx not
    warn anwyay.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Since the correct timeout value varies, use a heuristic which adjusts
    the timeout depending on how many packets we've seen. This gives
    slightly worse results, but doesn't need tweaking when GSO is
    introduced.

    500 usec 19.1887 xmit 561141 recv 1 timeout 559657
    Dynamic (278) 20.1974 xmit 214510 recv 5 timeout 214491 usec 278

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • virtio_ring has the ability to suppress notifications. This prevents
    a guest exit for every packet, but we need to set a timer on packet
    receipt to re-check if there were any remaining packets.

    Here are the times for 1G TCP Guest->Host with different timeout
    settings (it matters because the TCP window doesn't grow big enough to
    fill the entire buffer):

    Timeout value Seconds Xmit/Recv/Timeout
    None (before) 25.3784 xmit 7750233 recv 1
    2500 usec 62.5119 xmit 207020 recv 2 timeout 207020
    1000 usec 34.5379 xmit 207003 recv 2 timeout 207003
    750 usec 29.2305 xmit 207002 recv 1 timeout 207002
    500 usec 19.1887 xmit 561141 recv 1 timeout 559657
    250 usec 20.0465 xmit 214128 recv 2 timeout 214110
    100 usec 19.2583 xmit 561621 recv 1 timeout 560153

    (Note that these values are sensitive to the GSO patches which come
    later, and probably other traffic-related variables, so take with a
    large grain of salt).

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Number of exits transmitting 10GB Guest->Host before:
    network xmit 7858610 recv 118136

    After:
    network xmit 7750233 recv 1

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • To simplify the transition to when we publish indices in the ring
    (and make shuffling my patch queue easier), wrap them in a lg_last_avail()
    macro.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • To support my little make-x86-bitops-use-proper-typechecking projectlet.

    Cc: Thomas Gleixner
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Acked-by: Ingo Molnar
    Signed-off-by: Rusty Russell

    Andrew Morton
     
  • This is a simple patch to add support for the virtio "hardware random
    generator" to lguest. It gets about 1.2 MB/sec reading from /dev/hwrng
    in the guest.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • If you've got a nice DHCP configuration which maps MAC
    addresses to specific IP addresses, then you're going to
    want to start your guest with one of those MAC addresses.

    Also, in Fedora, we have persistent network interface naming
    based on the MAC address, so with randomly assigned
    addresses you're soon going to hit eth13. Who knows what
    will happen then!

    Allow assigning a MAC address to the network interface with
    e.g.

    --tunnet=bridge:eth0:00:FF:95:6B:DA:3D

    or:

    --tunnet=192.168.121.1:00:FF:95:6B:DA:3D

    which is pretty unintelligable, but ...

    (includes Rusty's minor rework)

    Signed-off-by: Mark McLoughlin
    Signed-off-by: Rusty Russell

    Mark McLoughlin
     
  • Signed-off-by: Mark McLoughlin
    Signed-off-by: Rusty Russell

    Mark McLoughlin
     
  • %02x is more appropriate for bytes than %08x.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • map_switcher allocates the array, unmap_switcher has to free it
    accordingly.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Rusty Russell

    Johannes Weiner
     
  • Ron Minnich noticed that guest userspace gets a GPF when it tries to int3:
    we need to copy the privilege level from the guest-supplied IDT to the real
    IDT. int3 is the only common case where guest userspace expects to invoke
    an interrupt, so that's the symptom of failing to do this.

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • 6af61a7614a306fe882a0c2b4ddc63b65aa66efc 'x86: clean up max_pfn_mapped
    usage - 32-bit' makes the following comment:

    XEN PV and lguest may need to assign max_pfn_mapped too.

    But no CC. Yinghai, wasting fellow developers' time is a VERY bad
    habit. If you do it again, I will hunt you down and try to extract
    the three hours of my life I just lost :)

    Signed-off-by: Rusty Russell
    Cc: Yinghai Lu

    Rusty Russell
     
  • Signed-off-by: Dmitry Baryshkov
    Signed-off-by: Samuel Ortiz

    Dmitry Baryshkov
     
  • arch/x86/mm/pgtable.c: In function 'pgd_mop_up_pmds':
    arch/x86/mm/pgtable.c:194: warning: unused variable 'pmd'

    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The computed color value is never actually written to hardware
    colormap register.

    Signed-off-by: Manuel Lauss
    Cc: Nobuhiro Iwamatsu
    Cc: Munakata Hisao
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manuel Lauss
     
  • With SLUB debugging turned on in 2.6.26, I was getting memory corruption
    when testing eCryptfs. The root cause turned out to be that eCryptfs was
    doing kmalloc(PAGE_CACHE_SIZE); virt_to_page() and treating that as a nice
    page-aligned chunk of memory. But at least with SLUB debugging on, this
    is not always true, and the page we get from virt_to_page does not
    necessarily match the PAGE_CACHE_SIZE worth of memory we got from kmalloc.

    My simple testcase was 2 loops doing "rm -f fileX; cp /tmp/fileX ." for 2
    different multi-megabyte files. With this change I no longer see the
    corruption.

    Signed-off-by: Eric Sandeen
    Acked-by: Michael Halcrow
    Acked-by: Rik van Riel
    Cc: [2.6.25.x, 2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • If CONFIG_GENERIC_GPIO=y && CONFIG_GPIO_SYSFS=n, gpio_export() in
    asm-generic/gpio.h refers -ENOSYS and causes build error.

    Signed-off-by: Atsushi Nemoto
    Acked-by: David Brownell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Nemoto
     
  • I got section mismatch message about bio_integrity_init_slab().

    WARNING: fs/built-in.o(__ksymtab+0xb60): Section mismatch in reference from the variable __ksymtab_bio_integrity_init_slab to the function .init.text:bio_integrity_init_slab()

    The symbol bio_integrity_init_slab is exported and annotated __init Fix
    this by removing the __init annotation of bio_integrity_init_slab or drop
    the export.

    It only call from init_bio(). The EXPORT_SYMBOL() can be removed.

    Signed-off-by: Yoichi Yuasa
    Cc: "Martin K. Petersen"
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yoichi Yuasa