07 May, 2009
1 commit
-
There is what we believe to be a false positive reported by lockdep.
inotify_inode_queue_event() => take inotify_mutex => kernel_event() =>
kmalloc() => SLOB => alloc_pages_node() => page reclaim => slab reclaim =>
dcache reclaim => inotify_inode_is_dead => take inotify_mutex => deadlockThe plan is to fix this via lockdep annotation, but that is proving to be
quite involved.The patch flips the allocation over to GFP_NFS to shut the warning up, for
the 2.6.30 release.Hopefully we will fix this for real in 2.6.31. I'll queue a patch in -mm
to switch it back to GFP_KERNEL so we don't forget.=================================
[ INFO: inconsistent lock state ]
2.6.30-rc2-next-20090417 #203
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/380 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&inode->inotify_mutex){+.+.?.}, at: [] inotify_inode_is_dead+0x35/0xb0
{RECLAIM_FS-ON-W} state was registered at:
[] mark_held_locks+0x68/0x90
[] lockdep_trace_alloc+0xf5/0x100
[] __kmalloc_node+0x31/0x1e0
[] kernel_event+0xe2/0x190
[] inotify_dev_queue_event+0x126/0x230
[] inotify_inode_queue_event+0xc6/0x110
[] vfs_create+0xcd/0x140
[] do_filp_open+0x88d/0xa20
[] do_sys_open+0x98/0x140
[] sys_open+0x20/0x30
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff
irq event stamp: 690455
hardirqs last enabled at (690455): [] _spin_unlock_irqrestore+0x44/0x80
hardirqs last disabled at (690454): [] _spin_lock_irqsave+0x32/0xa0
softirqs last enabled at (690178): [] __do_softirq+0x202/0x220
softirqs last disabled at (690157): [] call_softirq+0x1c/0x50other info that might help us debug this:
2 locks held by kswapd0/380:
#0: (shrinker_rwsem){++++..}, at: [] shrink_slab+0x37/0x180
#1: (&type->s_umount_key#17){++++..}, at: [] shrink_dcache_memory+0x11f/0x1e0stack backtrace:
Pid: 380, comm: kswapd0 Not tainted 2.6.30-rc2-next-20090417 #203
Call Trace:
[] print_usage_bug+0x19f/0x200
[] ? save_stack_trace+0x2f/0x50
[] mark_lock+0x4bb/0x6d0
[] ? check_usage_forwards+0x0/0xc0
[] __lock_acquire+0xc62/0x1ae0
[] ? slob_free+0x10c/0x370
[] lock_acquire+0xe1/0x120
[] ? inotify_inode_is_dead+0x35/0xb0
[] mutex_lock_nested+0x63/0x420
[] ? inotify_inode_is_dead+0x35/0xb0
[] ? inotify_inode_is_dead+0x35/0xb0
[] ? sched_clock+0x9/0x10
[] ? lock_release_holdtime+0x35/0x1c0
[] inotify_inode_is_dead+0x35/0xb0
[] dentry_iput+0xbc/0xe0
[] d_kill+0x33/0x60
[] __shrink_dcache_sb+0x2d3/0x350
[] shrink_dcache_memory+0x15a/0x1e0
[] shrink_slab+0x125/0x180
[] kswapd+0x560/0x7a0
[] ? isolate_pages_global+0x0/0x2c0
[] ? autoremove_wake_function+0x0/0x40
[] ? trace_hardirqs_on+0xd/0x10
[] ? kswapd+0x0/0x7a0
[] kthread+0x5b/0xa0
[] child_rip+0xa/0x20
[] ? restore_args+0x0/0x30
[] ? kthread+0x0/0xa0
[] ? child_rip+0x0/0x20[eparis@redhat.com: fix audit too]
Cc: Al Viro
Cc: Matt Mackall
Cc: Christoph Lameter
Signed-off-by: Wu Fengguang
Signed-off-by: Eric Paris
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Mar, 2009
1 commit
-
To be on the safe side, it should be less fragile to exclude I_NEW inodes
from inode list scans by default (unless there is an important reason to
have them).Normally they will get excluded (eg. by zero refcount or writecount etc),
however it is a bit fragile for list walkers to know exactly what parts of
the inode state is set up and valid to test when in I_NEW. So along these
lines, move I_NEW checks upward as well (sometimes taking I_FREEING etc
checks with them too -- this shouldn't be a problem should it?)Signed-off-by: Nick Piggin
Acked-by: Jan Kara
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro
19 Feb, 2009
1 commit
-
Enhanced lockdep coverage of __GFP_NOFS turned up this new lockdep
assert:[ 1093.677775]
[ 1093.677781] =================================
[ 1093.680031] [ INFO: inconsistent lock state ]
[ 1093.680031] 2.6.29-rc5-tip-01504-gb49eca1-dirty #1
[ 1093.680031] ---------------------------------
[ 1093.680031] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 1093.680031] kswapd0/308 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 1093.680031] (&inode->inotify_mutex){+.+.?.}, at: [] inotify_inode_is_dead+0x20/0x80
[ 1093.680031] {RECLAIM_FS-ON-W} state was registered at:
[ 1093.680031] [] mark_held_locks+0x43/0x5b
[ 1093.680031] [] lockdep_trace_alloc+0x6c/0x6e
[ 1093.680031] [] kmem_cache_alloc+0x20/0x150
[ 1093.680031] [] idr_pre_get+0x27/0x6c
[ 1093.680031] [] inotify_handle_get_wd+0x25/0xad
[ 1093.680031] [] inotify_add_watch+0x7a/0x129
[ 1093.680031] [] sys_inotify_add_watch+0x20f/0x250
[ 1093.680031] [] sysenter_do_call+0x12/0x35
[ 1093.680031] [] 0xffffffff
[ 1093.680031] irq event stamp: 60417
[ 1093.680031] hardirqs last enabled at (60417): [] call_rcu+0x53/0x59
[ 1093.680031] hardirqs last disabled at (60416): [] call_rcu+0x17/0x59
[ 1093.680031] softirqs last enabled at (59656): [] __do_softirq+0x157/0x16b
[ 1093.680031] softirqs last disabled at (59651): [] do_softirq+0x74/0x15d
[ 1093.680031]
[ 1093.680031] other info that might help us debug this:
[ 1093.680031] 2 locks held by kswapd0/308:
[ 1093.680031] #0: (shrinker_rwsem){++++..}, at: [] shrink_slab+0x36/0x189
[ 1093.680031] #1: (&type->s_umount_key#4){+++++.}, at: [] shrink_dcache_memory+0x110/0x1fb
[ 1093.680031]
[ 1093.680031] stack backtrace:
[ 1093.680031] Pid: 308, comm: kswapd0 Not tainted 2.6.29-rc5-tip-01504-gb49eca1-dirty #1
[ 1093.680031] Call Trace:
[ 1093.680031] [] valid_state+0x12a/0x13d
[ 1093.680031] [] mark_lock+0xc1/0x1e9
[ 1093.680031] [] ? check_usage_forwards+0x0/0x3f
[ 1093.680031] [] __lock_acquire+0x2c6/0xac8
[ 1093.680031] [] ? register_lock_class+0x17/0x228
[ 1093.680031] [] lock_acquire+0x5d/0x7a
[ 1093.680031] [] ? inotify_inode_is_dead+0x20/0x80
[ 1093.680031] [] __mutex_lock_common+0x3a/0x4cb
[ 1093.680031] [] ? inotify_inode_is_dead+0x20/0x80
[ 1093.680031] [] mutex_lock_nested+0x2e/0x36
[ 1093.680031] [] ? inotify_inode_is_dead+0x20/0x80
[ 1093.680031] [] inotify_inode_is_dead+0x20/0x80
[ 1093.680031] [] dentry_iput+0x90/0xc2
[ 1093.680031] [] d_kill+0x21/0x45
[ 1093.680031] [] __shrink_dcache_sb+0x27f/0x355
[ 1093.680031] [] shrink_dcache_memory+0x15e/0x1fb
[ 1093.680031] [] shrink_slab+0x121/0x189
[ 1093.680031] [] kswapd+0x39f/0x561
[ 1093.680031] [] ? isolate_pages_global+0x0/0x233
[ 1093.680031] [] ? autoremove_wake_function+0x0/0x43
[ 1093.680031] [] ? kswapd+0x0/0x561
[ 1093.680031] [] kthread+0x41/0x82
[ 1093.680031] [] ? kthread+0x0/0x82
[ 1093.680031] [] kernel_thread_helper+0x7/0x10inotify_handle_get_wd() does idr_pre_get() which does a
kmem_cache_alloc() without __GFP_FS - and is hence deadlockable under
extreme MM pressure.Signed-off-by: Ingo Molnar
Acked-by: Peter Zijlstra
Cc: MinChan Kim
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Jan, 2009
1 commit
-
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.The best fix (contributed largely by Linus) is a total rewrite
of the function in question:On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.Cc: John McCutchan
Cc: Robert Love
Cc:
Signed-off-by: Vegard Nossum
Signed-off-by: Linus Torvalds
14 Jan, 2009
2 commits
-
Signed-off-by: Heiko Carstens
-
Signed-off-by: Heiko Carstens
06 Jan, 2009
1 commit
-
The problems lie in the types used for some inotify interfaces, both at the kernel level and at the glibc level. This mail addresses the kernel problem. I will follow up with some suggestions for glibc changes.
For the sys_inotify_rm_watch() interface, the type of the 'wd' argument is
currently 'u32', it should be '__s32' . That is Robert's suggestion, and
is consistent with the other declarations of watch descriptors in the
kernel source, in particular, the inotify_event structure in
include/linux/inotify.h:struct inotify_event {
__s32 wd; /* watch descriptor */
__u32 mask; /* watch mask */
__u32 cookie; /* cookie to synchronize two events */
__u32 len; /* length (including nulls) of name */
char name[0]; /* stub for possible name */
};The patch makes the changes needed for inotify_rm_watch().
Signed-off-by: Michael Kerrisk
Cc: Robert Love
Cc: Vegard Nossum
Cc: Ulrich Drepper
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro
01 Jan, 2009
1 commit
-
Creating a generic filesystem notification interface, fsnotify, which will be
used by inotify, dnotify, and eventually fanotify is really starting to
clutter the fs directory. This patch simply moves inotify and dnotify into
fs/notify/inotify and fs/notify/dnotify respectively to make both current fs/
and future notification tidier.Signed-off-by: Eric Paris
Signed-off-by: Al Viro