11 Aug, 2008
9 commits
-
Solve this by marking the classes as unused and not printing information
about the unused classes.Reported-by: Eric Sesterhenn
Signed-off-by: Rabin Vincent
Acked-by: Huang Ying
Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar -
Lockdep spotted:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.27-rc1 #270
-------------------------------------------------------
qemu-kvm/2033 is trying to acquire lock:
(&inode->i_data.i_mmap_lock){----}, at: [] mm_take_all_locks+0xc2/0xeabut task is already holding lock:
(&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xeawhich lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&anon_vma->lock){----}:
[] __lock_acquire+0x11be/0x14d2
[] lock_acquire+0x5e/0x7a
[] _spin_lock+0x3b/0x47
[] vma_adjust+0x200/0x444
[] split_vma+0x12f/0x146
[] mprotect_fixup+0x13c/0x536
[] sys_mprotect+0x1a9/0x21e
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff-> #0 (&inode->i_data.i_mmap_lock){----}:
[] __lock_acquire+0xedb/0x14d2
[] lock_release_non_nested+0x1c2/0x219
[] lock_release+0x127/0x14a
[] _spin_unlock+0x1e/0x50
[] mm_drop_all_locks+0x7f/0xb0
[] do_mmu_notifier_register+0xe2/0x112
[] mmu_notifier_register+0xe/0x10
[] kvm_dev_ioctl+0x11e/0x287 [kvm]
[] vfs_ioctl+0x2a/0x78
[] do_vfs_ioctl+0x257/0x274
[] sys_ioctl+0x55/0x78
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffffother info that might help us debug this:
5 locks held by qemu-kvm/2033:
#0: (&mm->mmap_sem){----}, at: [] do_mmu_notifier_register+0x55/0x112
#1: (mm_all_locks_mutex){--..}, at: [] mm_take_all_locks+0x34/0xea
#2: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
#3: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
#4: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xeastack backtrace:
Pid: 2033, comm: qemu-kvm Not tainted 2.6.27-rc1 #270Call Trace:
[] print_circular_bug_tail+0xb8/0xc3
[] __lock_acquire+0xedb/0x14d2
[] ? add_lock_to_list+0x7e/0xad
[] ? mm_take_all_locks+0x70/0xea
[] ? mm_take_all_locks+0x70/0xea
[] lock_release_non_nested+0x1c2/0x219
[] ? mm_take_all_locks+0xc2/0xea
[] ? mm_take_all_locks+0xc2/0xea
[] ? trace_hardirqs_on_caller+0x4d/0x115
[] ? mm_drop_all_locks+0x7f/0xb0
[] lock_release+0x127/0x14a
[] _spin_unlock+0x1e/0x50
[] mm_drop_all_locks+0x7f/0xb0
[] do_mmu_notifier_register+0xe2/0x112
[] mmu_notifier_register+0xe/0x10
[] kvm_dev_ioctl+0x11e/0x287 [kvm]
[] ? file_has_perm+0x83/0x8e
[] vfs_ioctl+0x2a/0x78
[] do_vfs_ioctl+0x257/0x274
[] sys_ioctl+0x55/0x78
[] system_call_fastpath+0x16/0x1bWhich the locking hierarchy in mm/rmap.c confirms as valid.
Fix this by first taking all the mapping->i_mmap_lock instances and then
take all anon_vma->lock instances.Signed-off-by: Peter Zijlstra
Acked-by: Hugh Dickins
Signed-off-by: Ingo Molnar -
The nesting is correct due to holding mmap_sem, use the new annotation
to annotate this.Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar -
Expose the new lock protection lock.
This can be used to annotate places where we take multiple locks of the
same class and avoid deadlocks by always taking another (top-level) lock
first.NOTE: we're still bound to the MAX_LOCK_DEPTH (48) limit.
Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar -
On Fri, 2008-08-01 at 16:26 -0700, Linus Torvalds wrote:
> On Fri, 1 Aug 2008, David Miller wrote:
> >
> > Taking more than a few locks of the same class at once is bad
> > news and it's better to find an alternative method.
>
> It's not always wrong.
>
> If you can guarantee that anybody that takes more than one lock of a
> particular class will always take a single top-level lock _first_, then
> that's all good. You can obviously screw up and take the same lock _twice_
> (which will deadlock), but at least you cannot get into ABBA situations.
>
> So maybe the right thing to do is to just teach lockdep about "lock
> protection locks". That would have solved the multi-queue issues for
> networking too - all the actual network drivers would still have taken
> just their single queue lock, but the one case that needs to take all of
> them would have taken a separate top-level lock first.
>
> Never mind that the multi-queue locks were always taken in the same order:
> it's never wrong to just have some top-level serialization, and anybody
> who needs to take locks might as well do , because they sure as
> hell aren't going to be on _any_ fastpaths.
>
> So the simplest solution really sounds like just teaching lockdep about
> that one special case. It's not "nesting" exactly, although it's obviously
> related to it.Do as Linus suggested. The lock protection lock is called nest_lock.
Note that we still have the MAX_LOCK_DEPTH (48) limit to consider, so anything
that spills that it still up shit creek.Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar -
Most the free-standing lock_acquire() usages look remarkably similar, sweep
them into a new helper.Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar -
struct held_lock {
u64 prev_chain_key; /* 0 8 */
struct lock_class * class; /* 8 8 */
long unsigned int acquire_ip; /* 16 8 */
struct lockdep_map * instance; /* 24 8 */
int irq_context; /* 32 4 */
int trylock; /* 36 4 */
int read; /* 40 4 */
int check; /* 44 4 */
int hardirqs_off; /* 48 4 *//* size: 56, cachelines: 1 */
/* padding: 4 */
/* last cacheline: 56 bytes */
};struct held_lock {
u64 prev_chain_key; /* 0 8 */
long unsigned int acquire_ip; /* 8 8 */
struct lockdep_map * instance; /* 16 8 */
unsigned int class_idx:11; /* 24:21 4 */
unsigned int irq_context:2; /* 24:19 4 */
unsigned int trylock:1; /* 24:18 4 */
unsigned int read:2; /* 24:16 4 */
unsigned int check:2; /* 24:14 4 */
unsigned int hardirqs_off:1; /* 24:13 4 *//* size: 32, cachelines: 1 */
/* padding: 4 */
/* bit_padding: 13 bits */
/* last cacheline: 32 bytes */
};[mingo@elte.hu: shrunk hlock->class too]
[peterz@infradead.org: fixup bit sizes]
Signed-off-by: Dave Jones
Signed-off-by: Ingo Molnar
Signed-off-by: Peter Zijlstra -
Instead of using a per-rq lock class, use the regular nesting operations.
However, take extra care with double_lock_balance() as it can release the
already held rq->lock (and therefore change its nesting class).So what can happen is:
spin_lock(rq->lock); // this rq subclass 0
double_lock_balance(rq, other_rq);
// release rq
// acquire other_rq->lock subclass 0
// acquire rq->lock subclass 1spin_unlock(other_rq->lock);
leaving you with rq->lock in subclass 1
So a subsequent double_lock_balance() call can try to nest a subclass 1
lock while already holding a subclass 1 lock.Fix this by introducing double_unlock_balance() which releases the other
rq's lock, but also re-sets the subclass for this rq's lock to 0.Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar -
this can be used to reset a held lock's subclass, for arbitrary-depth
iterated data structures such as trees or lists which have per-node
locks.Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar
01 Aug, 2008
3 commits
-
While thinking about David's graph walk lockdep patch it _finally_
dawned on me that there is no reason we have a lock class per cpu ...Sorry for being dense :-/
The below changes the annotation from a lock class per cpu, to a single
nested lock, as the scheduler never holds more that 2 rq locks at a time
anyway.If there was code requiring holding all rq locks this would not work and
the original annotation would be the only option, but that not being the
case, this is a much lighter one.Compiles and boots on a 2-way x86_64.
Signed-off-by: Peter Zijlstra
Cc: David Miller
Signed-off-by: Ingo Molnar -
Otherwise lock debugging messages on runqueue locks can deadlock the
system due to the wakeups performed by printk().Signed-off-by: David S. Miller
Signed-off-by: Ingo Molnar -
When we traverse the graph, either forwards or backwards, we
are interested in whether a certain property exists somewhere
in a node reachable in the graph.Therefore it is never necessary to traverse through a node more
than once to get a correct answer to the given query.Take advantage of this property using a global ID counter so that we
need not clear all the markers in all the lock_class entries before
doing a traversal. A new ID is choosen when we start to traverse, and
we continue through a lock_class only if it's ID hasn't been marked
with the new value yet.This short-circuiting is essential especially for high CPU count
systems. The scheduler has a runqueue per cpu, and needs to take
two runqueue locks at a time, which leads to long chains of
backwards and forwards subgraphs from these runqueue lock nodes.
Without the short-circuit implemented here, a graph traversal on
a runqueue lock can take up to (1 << (N - 1)) checks on a system
with N cpus.For anything more than 16 cpus or so, lockdep will eventually bring
the machine to a complete standstill.Signed-off-by: David S. Miller
Acked-by: Peter Zijlstra
Signed-off-by: Ingo Molnar
29 Jul, 2008
28 commits
-
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
lguest: turn Waker into a thread, not a process
lguest: Enlarge virtio rings
lguest: Use GSO/IFF_VNET_HDR extensions on tun/tap
lguest: Remove 'network: no dma buffer!' warning
lguest: Adaptive timeout
lguest: Tell Guest net not to notify us on every packet xmit
lguest: net block unneeded receive queue update notifications
lguest: wrap last_avail accesses.
lguest: use cpu capability accessors
lguest: virtio-rng support
lguest: Support assigning a MAC address
lguest: Don't leak /dev/zero fd
lguest: fix verbose printing of device features.
lguest: fix switcher_page leak on unload
lguest: Guest int3 fix
lguest: set max_pfn_mapped, growl loudly at Yinghai Lu -
* 'for-linus' of git://git.o-hand.com/linux-mfd:
mfd: accept pure device as a parent, not only platform_device
mfd: add platform_data to mfd_cell
mfd: Coding style fixes
mfd: Use to_platform_device instead of container_of -
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (21 commits)
x86/PCI: use dev_printk when possible
PCI: add D3 power state avoidance quirk
PCI: fix bogus "'device' may be used uninitialized" warning in pci_slot
PCI: add an option to allow ASPM enabled forcibly
PCI: disable ASPM on pre-1.1 PCIe devices
PCI: disable ASPM per ACPI FADT setting
PCI MSI: Don't disable MSIs if the mask bit isn't supported
PCI: handle 64-bit resources better on 32-bit machines
PCI: rewrite PCI BAR reading code
PCI: document pci_target_state
PCI hotplug: fix typo in pcie hotplug output
x86 gart: replace to_pages macro with iommu_num_pages
x86, AMD IOMMU: replace to_pages macro with iommu_num_pages
iommu: add iommu_num_pages helper function
dma-coherent: add documentation to new interfaces
Cris: convert to using generic dma-coherent mem allocator
Sh: use generic per-device coherent dma allocator
ARM: support generic per-device coherent dma mem
Generic dma-coherent: fix DMA_MEMORY_EXCLUSIVE
x86: use generic per-device dma coherent allocator
... -
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6:
[SCSI] qla2xxx: fix msleep compile error -
Alexey Dobriyan reported trouble with LTP with the new fast-gup code,
and Johannes Weiner debugged it to non-page-aligned addresses, where the
new get_user_pages_fast() code would do all the wrong things, including
just traversing past the end of the requested area due to 'addr' never
matching 'end' exactly.This is not a pretty fix, and we may actually want to move the alignment
into generic code, leaving just the core code per-arch, but Alexey
verified that the vmsplice01 LTP test doesn't crash with this.Reported-and-tested-by: Alexey Dobriyan
Debugged-by: Johannes Weiner
Cc: Nick Piggin
Cc: Andrew Morton
Signed-off-by: Linus Torvalds -
lguest uses a Waker process to break it out of the kernel (ie.
actually running the guest) when file descriptor needs attention.Changing this from a process to a thread somewhat simplifies things:
it can directly access the fd_set of things to watch. More
importantly, it means that the Waker can see Guest memory correctly,
so /dev/vring file descriptors will work as anticipated (the
alternative is to actually mmap MAP_SHARED, but you can't do that with
/dev/zero).Signed-off-by: Rusty Russell
-
With big packets, 128 entries is a little small.
Guest -> Host 1GB TCP:
Before: 8.43625 seconds xmit 95640 recv 198266 timeout 49771 usec 1252
After: 8.01099 seconds xmit 49200 recv 102263 timeout 26014 usec 2118Signed-off-by: Rusty Russell
-
Guest -> Host 1GB TCP:
Before 20.1974 seconds xmit 214510 recv 5 timeout 214491 usec 278
After 8.43625 seconds xmit 95640 recv 198266 timeout 49771 usec 1252Host -> Guest 1GB TCP:
Before: Seconds 9.98854 xmit 172166 recv 5344 timeout 172157 usec 251
After: Seconds 5.72803 xmit 244322 recv 9919 timeout 244302 usec 156Signed-off-by: Rusty Russell
-
This warning can happen a lot under load, and it should be warnx not
warn anwyay.Signed-off-by: Rusty Russell
-
Since the correct timeout value varies, use a heuristic which adjusts
the timeout depending on how many packets we've seen. This gives
slightly worse results, but doesn't need tweaking when GSO is
introduced.500 usec 19.1887 xmit 561141 recv 1 timeout 559657
Dynamic (278) 20.1974 xmit 214510 recv 5 timeout 214491 usec 278Signed-off-by: Rusty Russell
-
virtio_ring has the ability to suppress notifications. This prevents
a guest exit for every packet, but we need to set a timer on packet
receipt to re-check if there were any remaining packets.Here are the times for 1G TCP Guest->Host with different timeout
settings (it matters because the TCP window doesn't grow big enough to
fill the entire buffer):Timeout value Seconds Xmit/Recv/Timeout
None (before) 25.3784 xmit 7750233 recv 1
2500 usec 62.5119 xmit 207020 recv 2 timeout 207020
1000 usec 34.5379 xmit 207003 recv 2 timeout 207003
750 usec 29.2305 xmit 207002 recv 1 timeout 207002
500 usec 19.1887 xmit 561141 recv 1 timeout 559657
250 usec 20.0465 xmit 214128 recv 2 timeout 214110
100 usec 19.2583 xmit 561621 recv 1 timeout 560153(Note that these values are sensitive to the GSO patches which come
later, and probably other traffic-related variables, so take with a
large grain of salt).Signed-off-by: Rusty Russell
-
Number of exits transmitting 10GB Guest->Host before:
network xmit 7858610 recv 118136After:
network xmit 7750233 recv 1Signed-off-by: Rusty Russell
-
To simplify the transition to when we publish indices in the ring
(and make shuffling my patch queue easier), wrap them in a lg_last_avail()
macro.Signed-off-by: Rusty Russell
-
To support my little make-x86-bitops-use-proper-typechecking projectlet.
Cc: Thomas Gleixner
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Acked-by: Ingo Molnar
Signed-off-by: Rusty Russell -
This is a simple patch to add support for the virtio "hardware random
generator" to lguest. It gets about 1.2 MB/sec reading from /dev/hwrng
in the guest.Signed-off-by: Rusty Russell
-
If you've got a nice DHCP configuration which maps MAC
addresses to specific IP addresses, then you're going to
want to start your guest with one of those MAC addresses.Also, in Fedora, we have persistent network interface naming
based on the MAC address, so with randomly assigned
addresses you're soon going to hit eth13. Who knows what
will happen then!Allow assigning a MAC address to the network interface with
e.g.--tunnet=bridge:eth0:00:FF:95:6B:DA:3D
or:
--tunnet=192.168.121.1:00:FF:95:6B:DA:3D
which is pretty unintelligable, but ...
(includes Rusty's minor rework)
Signed-off-by: Mark McLoughlin
Signed-off-by: Rusty Russell -
Signed-off-by: Mark McLoughlin
Signed-off-by: Rusty Russell -
%02x is more appropriate for bytes than %08x.
Signed-off-by: Rusty Russell
-
map_switcher allocates the array, unmap_switcher has to free it
accordingly.Signed-off-by: Johannes Weiner
Signed-off-by: Rusty Russell -
Ron Minnich noticed that guest userspace gets a GPF when it tries to int3:
we need to copy the privilege level from the guest-supplied IDT to the real
IDT. int3 is the only common case where guest userspace expects to invoke
an interrupt, so that's the symptom of failing to do this.Signed-off-by: Rusty Russell
-
6af61a7614a306fe882a0c2b4ddc63b65aa66efc 'x86: clean up max_pfn_mapped
usage - 32-bit' makes the following comment:XEN PV and lguest may need to assign max_pfn_mapped too.
But no CC. Yinghai, wasting fellow developers' time is a VERY bad
habit. If you do it again, I will hunt you down and try to extract
the three hours of my life I just lost :)Signed-off-by: Rusty Russell
Cc: Yinghai Lu -
Signed-off-by: Dmitry Baryshkov
Signed-off-by: Samuel Ortiz -
arch/x86/mm/pgtable.c: In function 'pgd_mop_up_pmds':
arch/x86/mm/pgtable.c:194: warning: unused variable 'pmd'Cc: Ingo Molnar
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The computed color value is never actually written to hardware
colormap register.Signed-off-by: Manuel Lauss
Cc: Nobuhiro Iwamatsu
Cc: Munakata Hisao
Cc: Paul Mundt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
With SLUB debugging turned on in 2.6.26, I was getting memory corruption
when testing eCryptfs. The root cause turned out to be that eCryptfs was
doing kmalloc(PAGE_CACHE_SIZE); virt_to_page() and treating that as a nice
page-aligned chunk of memory. But at least with SLUB debugging on, this
is not always true, and the page we get from virt_to_page does not
necessarily match the PAGE_CACHE_SIZE worth of memory we got from kmalloc.My simple testcase was 2 loops doing "rm -f fileX; cp /tmp/fileX ." for 2
different multi-megabyte files. With this change I no longer see the
corruption.Signed-off-by: Eric Sandeen
Acked-by: Michael Halcrow
Acked-by: Rik van Riel
Cc: [2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If CONFIG_GENERIC_GPIO=y && CONFIG_GPIO_SYSFS=n, gpio_export() in
asm-generic/gpio.h refers -ENOSYS and causes build error.Signed-off-by: Atsushi Nemoto
Acked-by: David Brownell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I got section mismatch message about bio_integrity_init_slab().
WARNING: fs/built-in.o(__ksymtab+0xb60): Section mismatch in reference from the variable __ksymtab_bio_integrity_init_slab to the function .init.text:bio_integrity_init_slab()
The symbol bio_integrity_init_slab is exported and annotated __init Fix
this by removing the __init annotation of bio_integrity_init_slab or drop
the export.It only call from init_bio(). The EXPORT_SYMBOL() can be removed.
Signed-off-by: Yoichi Yuasa
Cc: "Martin K. Petersen"
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds