07 May, 2009

1 commit

  • There is what we believe to be a false positive reported by lockdep.

    inotify_inode_queue_event() => take inotify_mutex => kernel_event() =>
    kmalloc() => SLOB => alloc_pages_node() => page reclaim => slab reclaim =>
    dcache reclaim => inotify_inode_is_dead => take inotify_mutex => deadlock

    The plan is to fix this via lockdep annotation, but that is proving to be
    quite involved.

    The patch flips the allocation over to GFP_NFS to shut the warning up, for
    the 2.6.30 release.

    Hopefully we will fix this for real in 2.6.31. I'll queue a patch in -mm
    to switch it back to GFP_KERNEL so we don't forget.

    =================================
    [ INFO: inconsistent lock state ]
    2.6.30-rc2-next-20090417 #203
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/380 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (&inode->inotify_mutex){+.+.?.}, at: [] inotify_inode_is_dead+0x35/0xb0
    {RECLAIM_FS-ON-W} state was registered at:
    [] mark_held_locks+0x68/0x90
    [] lockdep_trace_alloc+0xf5/0x100
    [] __kmalloc_node+0x31/0x1e0
    [] kernel_event+0xe2/0x190
    [] inotify_dev_queue_event+0x126/0x230
    [] inotify_inode_queue_event+0xc6/0x110
    [] vfs_create+0xcd/0x140
    [] do_filp_open+0x88d/0xa20
    [] do_sys_open+0x98/0x140
    [] sys_open+0x20/0x30
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff
    irq event stamp: 690455
    hardirqs last enabled at (690455): [] _spin_unlock_irqrestore+0x44/0x80
    hardirqs last disabled at (690454): [] _spin_lock_irqsave+0x32/0xa0
    softirqs last enabled at (690178): [] __do_softirq+0x202/0x220
    softirqs last disabled at (690157): [] call_softirq+0x1c/0x50

    other info that might help us debug this:
    2 locks held by kswapd0/380:
    #0: (shrinker_rwsem){++++..}, at: [] shrink_slab+0x37/0x180
    #1: (&type->s_umount_key#17){++++..}, at: [] shrink_dcache_memory+0x11f/0x1e0

    stack backtrace:
    Pid: 380, comm: kswapd0 Not tainted 2.6.30-rc2-next-20090417 #203
    Call Trace:
    [] print_usage_bug+0x19f/0x200
    [] ? save_stack_trace+0x2f/0x50
    [] mark_lock+0x4bb/0x6d0
    [] ? check_usage_forwards+0x0/0xc0
    [] __lock_acquire+0xc62/0x1ae0
    [] ? slob_free+0x10c/0x370
    [] lock_acquire+0xe1/0x120
    [] ? inotify_inode_is_dead+0x35/0xb0
    [] mutex_lock_nested+0x63/0x420
    [] ? inotify_inode_is_dead+0x35/0xb0
    [] ? inotify_inode_is_dead+0x35/0xb0
    [] ? sched_clock+0x9/0x10
    [] ? lock_release_holdtime+0x35/0x1c0
    [] inotify_inode_is_dead+0x35/0xb0
    [] dentry_iput+0xbc/0xe0
    [] d_kill+0x33/0x60
    [] __shrink_dcache_sb+0x2d3/0x350
    [] shrink_dcache_memory+0x15a/0x1e0
    [] shrink_slab+0x125/0x180
    [] kswapd+0x560/0x7a0
    [] ? isolate_pages_global+0x0/0x2c0
    [] ? autoremove_wake_function+0x0/0x40
    [] ? trace_hardirqs_on+0xd/0x10
    [] ? kswapd+0x0/0x7a0
    [] kthread+0x5b/0xa0
    [] child_rip+0xa/0x20
    [] ? restore_args+0x0/0x30
    [] ? kthread+0x0/0xa0
    [] ? child_rip+0x0/0x20

    [eparis@redhat.com: fix audit too]
    Cc: Al Viro
    Cc: Matt Mackall
    Cc: Christoph Lameter
    Signed-off-by: Wu Fengguang
    Signed-off-by: Eric Paris
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

28 Mar, 2009

1 commit

  • To be on the safe side, it should be less fragile to exclude I_NEW inodes
    from inode list scans by default (unless there is an important reason to
    have them).

    Normally they will get excluded (eg. by zero refcount or writecount etc),
    however it is a bit fragile for list walkers to know exactly what parts of
    the inode state is set up and valid to test when in I_NEW. So along these
    lines, move I_NEW checks upward as well (sometimes taking I_FREEING etc
    checks with them too -- this shouldn't be a problem should it?)

    Signed-off-by: Nick Piggin
    Acked-by: Jan Kara
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Nick Piggin
     

19 Feb, 2009

1 commit

  • Enhanced lockdep coverage of __GFP_NOFS turned up this new lockdep
    assert:

    [ 1093.677775]
    [ 1093.677781] =================================
    [ 1093.680031] [ INFO: inconsistent lock state ]
    [ 1093.680031] 2.6.29-rc5-tip-01504-gb49eca1-dirty #1
    [ 1093.680031] ---------------------------------
    [ 1093.680031] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    [ 1093.680031] kswapd0/308 [HC0[0]:SC0[0]:HE1:SE1] takes:
    [ 1093.680031] (&inode->inotify_mutex){+.+.?.}, at: [] inotify_inode_is_dead+0x20/0x80
    [ 1093.680031] {RECLAIM_FS-ON-W} state was registered at:
    [ 1093.680031] [] mark_held_locks+0x43/0x5b
    [ 1093.680031] [] lockdep_trace_alloc+0x6c/0x6e
    [ 1093.680031] [] kmem_cache_alloc+0x20/0x150
    [ 1093.680031] [] idr_pre_get+0x27/0x6c
    [ 1093.680031] [] inotify_handle_get_wd+0x25/0xad
    [ 1093.680031] [] inotify_add_watch+0x7a/0x129
    [ 1093.680031] [] sys_inotify_add_watch+0x20f/0x250
    [ 1093.680031] [] sysenter_do_call+0x12/0x35
    [ 1093.680031] [] 0xffffffff
    [ 1093.680031] irq event stamp: 60417
    [ 1093.680031] hardirqs last enabled at (60417): [] call_rcu+0x53/0x59
    [ 1093.680031] hardirqs last disabled at (60416): [] call_rcu+0x17/0x59
    [ 1093.680031] softirqs last enabled at (59656): [] __do_softirq+0x157/0x16b
    [ 1093.680031] softirqs last disabled at (59651): [] do_softirq+0x74/0x15d
    [ 1093.680031]
    [ 1093.680031] other info that might help us debug this:
    [ 1093.680031] 2 locks held by kswapd0/308:
    [ 1093.680031] #0: (shrinker_rwsem){++++..}, at: [] shrink_slab+0x36/0x189
    [ 1093.680031] #1: (&type->s_umount_key#4){+++++.}, at: [] shrink_dcache_memory+0x110/0x1fb
    [ 1093.680031]
    [ 1093.680031] stack backtrace:
    [ 1093.680031] Pid: 308, comm: kswapd0 Not tainted 2.6.29-rc5-tip-01504-gb49eca1-dirty #1
    [ 1093.680031] Call Trace:
    [ 1093.680031] [] valid_state+0x12a/0x13d
    [ 1093.680031] [] mark_lock+0xc1/0x1e9
    [ 1093.680031] [] ? check_usage_forwards+0x0/0x3f
    [ 1093.680031] [] __lock_acquire+0x2c6/0xac8
    [ 1093.680031] [] ? register_lock_class+0x17/0x228
    [ 1093.680031] [] lock_acquire+0x5d/0x7a
    [ 1093.680031] [] ? inotify_inode_is_dead+0x20/0x80
    [ 1093.680031] [] __mutex_lock_common+0x3a/0x4cb
    [ 1093.680031] [] ? inotify_inode_is_dead+0x20/0x80
    [ 1093.680031] [] mutex_lock_nested+0x2e/0x36
    [ 1093.680031] [] ? inotify_inode_is_dead+0x20/0x80
    [ 1093.680031] [] inotify_inode_is_dead+0x20/0x80
    [ 1093.680031] [] dentry_iput+0x90/0xc2
    [ 1093.680031] [] d_kill+0x21/0x45
    [ 1093.680031] [] __shrink_dcache_sb+0x27f/0x355
    [ 1093.680031] [] shrink_dcache_memory+0x15e/0x1fb
    [ 1093.680031] [] shrink_slab+0x121/0x189
    [ 1093.680031] [] kswapd+0x39f/0x561
    [ 1093.680031] [] ? isolate_pages_global+0x0/0x233
    [ 1093.680031] [] ? autoremove_wake_function+0x0/0x43
    [ 1093.680031] [] ? kswapd+0x0/0x561
    [ 1093.680031] [] kthread+0x41/0x82
    [ 1093.680031] [] ? kthread+0x0/0x82
    [ 1093.680031] [] kernel_thread_helper+0x7/0x10

    inotify_handle_get_wd() does idr_pre_get() which does a
    kmem_cache_alloc() without __GFP_FS - and is hence deadlockable under
    extreme MM pressure.

    Signed-off-by: Ingo Molnar
    Acked-by: Peter Zijlstra
    Cc: MinChan Kim
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

27 Jan, 2009

1 commit

  • If userspace supplies an invalid pointer to a read() of an inotify
    instance, the inotify device's event list mutex is unlocked twice.
    This causes an unbalance which effectively leaves the data structure
    unprotected, and we can trigger oopses by accessing the inotify
    instance from different tasks concurrently.

    The best fix (contributed largely by Linus) is a total rewrite
    of the function in question:

    On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
    > The thing to notice is that:
    >
    > - locking is done in just one place, and there is no question about it
    > not having an unlock.
    >
    > - that whole double-while(1)-loop thing is gone.
    >
    > - use multiple functions to make nesting and error handling sane
    >
    > - do error testing after doing the things you always need to do, ie do
    > this:
    >
    > mutex_lock(..)
    > ret = function_call();
    > mutex_unlock(..)
    >
    > .. test ret here ..
    >
    > instead of doing conditional exits with unlocking or freeing.
    >
    > So if the code is written in this way, it may still be buggy, but at least
    > it's not buggy because of subtle "forgot to unlock" or "forgot to free"
    > issues.
    >
    > This _always_ unlocks if it locked, and it always frees if it got a
    > non-error kevent.

    Cc: John McCutchan
    Cc: Robert Love
    Cc:
    Signed-off-by: Vegard Nossum
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

14 Jan, 2009

2 commits


06 Jan, 2009

1 commit

  • The problems lie in the types used for some inotify interfaces, both at the kernel level and at the glibc level. This mail addresses the kernel problem. I will follow up with some suggestions for glibc changes.

    For the sys_inotify_rm_watch() interface, the type of the 'wd' argument is
    currently 'u32', it should be '__s32' . That is Robert's suggestion, and
    is consistent with the other declarations of watch descriptors in the
    kernel source, in particular, the inotify_event structure in
    include/linux/inotify.h:

    struct inotify_event {
    __s32 wd; /* watch descriptor */
    __u32 mask; /* watch mask */
    __u32 cookie; /* cookie to synchronize two events */
    __u32 len; /* length (including nulls) of name */
    char name[0]; /* stub for possible name */
    };

    The patch makes the changes needed for inotify_rm_watch().

    Signed-off-by: Michael Kerrisk
    Cc: Robert Love
    Cc: Vegard Nossum
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Michael Kerrisk
     

01 Jan, 2009

1 commit

  • Creating a generic filesystem notification interface, fsnotify, which will be
    used by inotify, dnotify, and eventually fanotify is really starting to
    clutter the fs directory. This patch simply moves inotify and dnotify into
    fs/notify/inotify and fs/notify/dnotify respectively to make both current fs/
    and future notification tidier.

    Signed-off-by: Eric Paris
    Signed-off-by: Al Viro

    Eric Paris