06 Sep, 2015

1 commit

  • Pull vfs updates from Al Viro:
    "In this one:

    - d_move fixes (Eric Biederman)

    - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
    handling ought to be fixed)

    - switch of sb_writers to percpu rwsem (Oleg Nesterov)

    - superblock scalability (Josef Bacik and Dave Chinner)

    - swapon(2) race fix (Hugh Dickins)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
    vfs: Test for and handle paths that are unreachable from their mnt_root
    dcache: Reduce the scope of i_lock in d_splice_alias
    dcache: Handle escaped paths in prepend_path
    mm: fix potential data race in SyS_swapon
    inode: don't softlockup when evicting inodes
    inode: rename i_wb_list to i_io_list
    sync: serialise per-superblock sync operations
    inode: convert inode_sb_list_lock to per-sb
    inode: add hlist_fake to avoid the inode hash lock in evict
    writeback: plug writeback at a high level
    change sb_writers to use percpu_rw_semaphore
    shift percpu_counter_destroy() into destroy_super_work()
    percpu-rwsem: kill CONFIG_PERCPU_RWSEM
    percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
    percpu-rwsem: introduce percpu_down_read_trylock()
    document rwsem_release() in sb_wait_write()
    fix the broken lockdep logic in __sb_start_write()
    introduce __sb_writers_{acquired,release}() helpers
    ufs_inode_get{frag,block}(): get rid of 'phys' argument
    ufs_getfrag_block(): tidy up a bit
    ...

    Linus Torvalds
     

05 Sep, 2015

3 commits

  • fsnotify_destroy_mark_locked() is subtle to use because it temporarily
    releases group->mark_mutex. To avoid future problems with this
    function, split it into two.

    fsnotify_detach_mark() is the part that needs group->mark_mutex and
    fsnotify_free_mark() is the part that must be called outside of
    group->mark_mutex. This way it's much clearer what's going on and we
    also avoid some pointless acquisitions of group->mark_mutex.

    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Free list is used when all marks on given inode / mount should be
    destroyed when inode / mount is going away. However we can free all of
    the marks without using a special list with some care.

    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

18 Aug, 2015

1 commit

  • The process of reducing contention on per-superblock inode lists
    starts with moving the locking to match the per-superblock inode
    list. This takes the global lock out of the picture and reduces the
    contention problems to within a single filesystem. This doesn't get
    rid of contention as the locks still have global CPU scope, but it
    does isolate operations on different superblocks form each other.

    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Tested-by: Dave Chinner

    Dave Chinner
     

25 Jun, 2015

1 commit


14 Dec, 2014

2 commits

  • destroy_list is used to track marks which still need waiting for srcu
    period end before they can be freed. However by the time mark is added to
    destroy_list it isn't in group's list of marks anymore and thus we can
    reuse fsnotify_mark->g_list for queueing into destroy_list. This saves
    two pointers for each fsnotify_mark.

    Signed-off-by: Jan Kara
    Cc: Eric Paris
    Cc: Heinrich Schuchardt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • There's a lot of common code in inode and mount marks handling. Factor it
    out to a common helper function.

    Signed-off-by: Jan Kara
    Cc: Eric Paris
    Cc: Heinrich Schuchardt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

07 Aug, 2014

2 commits

  • Commit 85816794240b ("fanotify: Fix use after free for permission
    events") introduced a double free issue for permission events which are
    pending in group's notification queue while group is being destroyed.
    These events are freed from fanotify_handle_event() but they are not
    removed from groups notification queue and thus they get freed again
    from fsnotify_flush_notify().

    Fix the problem by removing permission events from notification queue
    before freeing them if we skip processing access response. Also expand
    comments in fanotify_release() to explain group shutdown in detail.

    Fixes: 85816794240b9659e66e4d9b0df7c6e814e5f603
    Signed-off-by: Jan Kara
    Reported-by: Douglas Leeder
    Tested-by: Douglas Leeder
    Reported-by: Heinrich Schuchard
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Rename fsnotify_add_notify_event() to fsnotify_add_event() since the
    "notify" part is duplicit. Rename fsnotify_remove_notify_event() and
    fsnotify_peek_notify_event() to fsnotify_remove_first_event() and
    fsnotify_peek_first_event() respectively since "notify" part is duplicit
    and they really look at the first event in the queue.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jan Kara
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

04 Apr, 2014

1 commit

  • access_mutex is used only to guard operations on access_list. There's
    no need for sleeping within this lock so just make a spinlock out of it.

    Signed-off-by: Jan Kara
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

25 Feb, 2014

1 commit

  • Commit 7053aee26a35 "fsnotify: do not share events between notification
    groups" used overflow event statically allocated in a group with the
    size of the generic notification event. This causes problems because
    some code looks at type specific parts of event structure and gets
    confused by a random data it sees there and causes crashes.

    Fix the problem by allocating overflow event with type corresponding to
    the group type so code cannot get confused.

    Signed-off-by: Jan Kara

    Jan Kara
     

18 Feb, 2014

1 commit

  • My rework of handling of notification events (namely commit 7053aee26a35
    "fsnotify: do not share events between notification groups") broke
    sending of cookies with inotify events. We didn't propagate the value
    passed to fsnotify() properly and passed 4 uninitialized bytes to
    userspace instead (so it is also an information leak). Sadly I didn't
    notice this during my testing because inotify cookies aren't used very
    much and LTP inotify tests ignore them.

    Fix the problem by passing the cookie value properly.

    Fixes: 7053aee26a3548ebaba046ae2e52396ccf56ac6c
    Reported-by: Vegard Nossum
    Signed-off-by: Jan Kara

    Jan Kara
     

29 Jan, 2014

1 commit

  • The event returned from fsnotify_add_notify_event() cannot ever be used
    safely as the event may be freed by the time the function returns (after
    dropping notification_mutex). So change the prototype to just return
    whether the event was added or merged into some existing event.

    Reported-and-tested-by: Jiri Kosina
    Reported-and-tested-by: Dave Jones
    Signed-off-by: Jan Kara

    Jan Kara
     

22 Jan, 2014

2 commits

  • After removing event structure creation from the generic layer there is
    no reason for separate .should_send_event and .handle_event callbacks.
    So just remove the first one.

    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently fsnotify framework creates one event structure for each
    notification event and links this event into all interested notification
    groups. This is done so that we save memory when several notification
    groups are interested in the event. However the need for event
    structure shared between inotify & fanotify bloats the event structure
    so the result is often higher memory consumption.

    Another problem is that fsnotify framework keeps path references with
    outstanding events so that fanotify can return open file descriptors
    with its events. This has the undesirable effect that filesystem cannot
    be unmounted while there are outstanding events - a regression for
    inotify compared to a situation before it was converted to fsnotify
    framework. For fanotify this problem is hard to avoid and users of
    fanotify should kind of expect this behavior when they ask for file
    descriptors from notified files.

    This patch changes fsnotify and its users to create separate event
    structure for each group. This allows for much simpler code (~400 lines
    removed by this patch) and also smaller event structures. For example
    on 64-bit system original struct fsnotify_event consumes 120 bytes, plus
    additional space for file name, additional 24 bytes for second and each
    subsequent group linking the event, and additional 32 bytes for each
    inotify group for private data. After the conversion inotify event
    consumes 48 bytes plus space for file name which is considerably less
    memory unless file names are long and there are several groups
    interested in the events (both of which are uncommon). Fanotify event
    fits in 56 bytes after the conversion (fanotify doesn't care about file
    names so its events don't have to have it allocated). A win unless
    there are four or more fanotify groups interested in the event.

    The conversion also solves the problem with unmount when only inotify is
    used as we don't have to grab path references for inotify events.

    [hughd@google.com: fanotify: fix corruption preventing startup]
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

30 Apr, 2013

1 commit


12 Dec, 2012

8 commits

  • inotify is supposed to support async signal notification when information
    is available on the inotify fd. This patch moves that support to generic
    fsnotify functions so it can be used by all notification mechanisms.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • On Mon, Aug 01, 2011 at 04:38:22PM -0400, Eric Paris wrote:
    >
    > I finally built and tested a v3.0 kernel with these patches (I know I'm
    > SOOOOOO far behind). Not what I hoped for:
    >
    > > [ 150.937798] VFS: Busy inodes after unmount of tmpfs. Self-destruct in 5 seconds. Have a nice day...
    > > [ 150.945290] BUG: unable to handle kernel NULL pointer dereference at 0000000000000070
    > > [ 150.946012] IP: [] shmem_free_inode+0x18/0x50
    > > [ 150.946012] PGD 2bf9e067 PUD 2bf9f067 PMD 0
    > > [ 150.946012] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    > > [ 150.946012] CPU 0
    > > [ 150.946012] Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ext4 jbd2 crc16 joydev ata_piix i2c_piix4 pcspkr uinput ipv6 autofs4 usbhid [last unloaded: scsi_wait_scan]
    > > [ 150.946012]
    > > [ 150.946012] Pid: 2764, comm: syscall_thrash Not tainted 3.0.0+ #1 Red Hat KVM
    > > [ 150.946012] RIP: 0010:[] [] shmem_free_inode+0x18/0x50
    > > [ 150.946012] RSP: 0018:ffff88002c2e5df8 EFLAGS: 00010282
    > > [ 150.946012] RAX: 000000004e370d9f RBX: 0000000000000000 RCX: ffff88003a029438
    > > [ 150.946012] RDX: 0000000033630a5f RSI: 0000000000000000 RDI: ffff88003491c240
    > > [ 150.946012] RBP: ffff88002c2e5e08 R08: 0000000000000000 R09: 0000000000000000
    > > [ 150.946012] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003a029428
    > > [ 150.946012] R13: ffff88003a029428 R14: ffff88003a029428 R15: ffff88003499a610
    > > [ 150.946012] FS: 00007f5a05420700(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000
    > > [ 150.946012] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    > > [ 150.946012] CR2: 0000000000000070 CR3: 000000002a662000 CR4: 00000000000006f0
    > > [ 150.946012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    > > [ 150.946012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    > > [ 150.946012] Process syscall_thrash (pid: 2764, threadinfo ffff88002c2e4000, task ffff88002bfbc760)
    > > [ 150.946012] Stack:
    > > [ 150.946012] ffff88003a029438 ffff88003a029428 ffff88002c2e5e38 ffffffff81102f76
    > > [ 150.946012] ffff88003a029438 ffff88003a029598 ffffffff8160f9c0 ffff88002c221250
    > > [ 150.946012] ffff88002c2e5e68 ffffffff8115e9be ffff88002c2e5e68 ffff88003a029438
    > > [ 150.946012] Call Trace:
    > > [ 150.946012] [] shmem_evict_inode+0x76/0x130
    > > [ 150.946012] [] evict+0x7e/0x170
    > > [ 150.946012] [] iput_final+0xd0/0x190
    > > [ 150.946012] [] iput+0x33/0x40
    > > [ 150.946012] [] fsnotify_destroy_mark_locked+0x145/0x160
    > > [ 150.946012] [] fsnotify_destroy_mark+0x36/0x50
    > > [ 150.946012] [] sys_inotify_rm_watch+0x77/0xd0
    > > [ 150.946012] [] system_call_fastpath+0x16/0x1b
    > > [ 150.946012] Code: 67 4a 00 b8 e4 ff ff ff eb aa 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 8b 9f 40 05 00 00
    > > [ 150.946012] 83 7b 70 00 74 1c 4c 8d a3 80 00 00 00 4c 89 e7 e8 d2 5d 4a
    > > [ 150.946012] RIP [] shmem_free_inode+0x18/0x50
    > > [ 150.946012] RSP
    > > [ 150.946012] CR2: 0000000000000070
    >
    > Looks at aweful lot like the problem from:
    > http://www.spinics.net/lists/linux-fsdevel/msg46101.html
    >

    I tried to reproduce this bug with your test program, but without success.
    However, if I understand correctly, this occurs since we dont hold any locks when
    we call iput() in mark_destroy(), right?
    With the patches you tested, iput() is also not called within any lock, since the
    groups mark_mutex is released temporarily before iput() is called. This is, since
    the original codes behaviour is similar.
    However since we now have a mutex as the biggest lock, we can do what you
    suggested (http://www.spinics.net/lists/linux-fsdevel/msg46107.html) and
    call iput() with the mutex held to avoid the race.
    The patch below implements this. It uses nested locking to avoid deadlock in case
    we do the final iput() on an inode which still holds marks and thus would take
    the mutex again when calling fsnotify_inode_delete() in destroy_inode().

    Signed-off-by: Lino Sanfilippo
    Signed-off-by: Eric Paris

    Lino Sanfilippo
     
  • In clear_marks_by_group_flags() the mark list of a group is iterated and the
    marks are put on a temporary list.
    Since we introduced fsnotify_destroy_mark_locked() we dont need the temp list
    any more and are able to remove the marks while the mark list is iterated and
    the mark list mutex is held.

    Signed-off-by: Lino Sanfilippo
    Signed-off-by: Eric Paris

    Lino Sanfilippo
     
  • This patch introduces fsnotify_add_mark_locked() and fsnotify_remove_mark_locked()
    which are essentially the same as fsnotify_add_mark() and fsnotify_remove_mark() but
    assume that the caller has already taken the groups mark mutex.

    Signed-off-by: Lino Sanfilippo
    Signed-off-by: Eric Paris

    Lino Sanfilippo
     
  • In fsnotify_destroy_mark() dont get the group from the passed mark anymore,
    but pass the group itself as an additional parameter to the function.

    Signed-off-by: Lino Sanfilippo
    Signed-off-by: Eric Paris

    Lino Sanfilippo
     
  • Replaces the groups mark_lock spinlock with a mutex. Using a mutex instead
    of a spinlock results in more flexibility (i.e it allows to sleep while the
    lock is held).

    Signed-off-by: Lino Sanfilippo
    Signed-off-by: Eric Paris

    Lino Sanfilippo
     
  • Introduce fsnotify_get_group() which increments the reference counter of a group.

    Signed-off-by: Lino Sanfilippo
    Signed-off-by: Eric Paris

    Lino Sanfilippo
     
  • Currently in fsnotify_put_group() the ref count of a group is decremented and if
    it becomes 0 fsnotify_destroy_group() is called. Since a groups ref count is only
    at group creation set to 1 and never increased after that a call to fsnotify_put_group()
    always results in a call to fsnotify_destroy_group().
    With this patch fsnotify_destroy_group() is called directly.

    Signed-off-by: Lino Sanfilippo
    Signed-off-by: Eric Paris

    Lino Sanfilippo
     

31 May, 2012

1 commit

  • Recently I'm working on fanotify and found the following strange
    behaviors.

    I wrote a program to set fanotify_mark on "/tmp/block" and FAN_DENY
    all events notified.

    fanotify_mask = FAN_ALL_EVENTS | FAN_ALL_PERM_EVENTS | FAN_EVENT_ON_CHILD:
    $ cd /tmp/block; cat foo
    cat: foo: Operation not permitted

    Operation on the file is blocked as expected.

    But,

    fanotify_mask = FAN_ALL_PERM_EVENTS | FAN_EVENT_ON_CHILD:
    $ cd /tmp/block; cat foo
    aaa

    It's not blocked anymore. This is confusing behavior. Also reading
    commit "fsnotify: call fsnotify_parent in perm events", it seems like
    fsnotify should handle subfiles' perm events as well as the other notify
    events.

    With this patch, regardless of FAN_ALL_EVENTS set or not:
    $ cd /tmp/block; cat foo
    cat: foo: Operation not permitted

    Operation on the file is now blocked properly.

    FS_OPEN_PERM and FS_ACCESS_PERM are not listed on FS_EVENTS_POSS_ON_CHILD.
    Due to fsnotify_inode_watches_children() check, if you only specify only
    these events as fsnotify_mask, you don't get subfiles' perm events
    notified.

    This patch add the events to FS_EVENTS_POSS_ON_CHILD to get them notified
    even if only these events are specified to fsnotify_mask.

    Signed-off-by: Naohiro Aota
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Naohiro Aota
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

07 Jan, 2011

1 commit


08 Dec, 2010

1 commit

  • When fanotify_release() is called, there may still be processes waiting for
    access permission. Currently only processes for which an event has already been
    queued into the groups access list will be woken up. Processes for which no
    event has been queued will continue to sleep and thus cause a deadlock when
    fsnotify_put_group() is called.
    Furthermore there is a race allowing further processes to be waiting on the
    access wait queue after wake_up (if they arrive before clear_marks_by_group()
    is called).
    This patch corrects this by setting a flag to inform processes that the group
    is about to be destroyed and thus not to wait for access permission.

    [additional changelog from eparis]
    Lets think about the 4 relevant code paths from the PoV of the
    'operator' 'listener' 'responder' and 'closer'. Where operator is the
    process doing an action (like open/read) which could require permission.
    Listener is the task (or in this case thread) slated with reading from
    the fanotify file descriptor. The 'responder' is the thread responsible
    for responding to access requests. 'Closer' is the thread attempting to
    close the fanotify file descriptor.

    The 'operator' is going to end up in:
    fanotify_handle_event()
    get_response_from_access()
    (THIS BLOCKS WAITING ON USERSPACE)

    The 'listener' interesting code path
    fanotify_read()
    copy_event_to_user()
    prepare_for_access_response()
    (THIS CREATES AN fanotify_response_event)

    The 'responder' code path:
    fanotify_write()
    process_access_response()
    (REMOVE A fanotify_response_event, SET RESPONSE, WAKE UP 'operator')

    The 'closer':
    fanotify_release()
    (SUPPOSED TO CLEAN UP THE REST OF THIS MESS)

    What we have today is that in the closer we remove all of the
    fanotify_response_events and set a bit so no more response events are
    ever created in prepare_for_access_response().

    The bug is that we never wake all of the operators up and tell them to
    move along. You fix that in fanotify_get_response_from_access(). You
    also fix other operators which haven't gotten there yet. So I agree
    that's a good fix.
    [/additional changelog from eparis]

    [remove additional changes to minimize patch size]
    [move initialization so it was inside CONFIG_FANOTIFY_PERMISSION]

    Signed-off-by: Lino Sanfilippo
    Signed-off-by: Eric Paris

    Lino Sanfilippo
     

29 Oct, 2010

7 commits


23 Aug, 2010

1 commit

  • When an fanotify listener is closing it may cause a deadlock between the
    listener and the original task doing an fs operation. If the original task
    is waiting for a permissions response it will be holding the srcu lock. The
    listener cannot clean up and exit until after that srcu lock is syncronized.
    Thus deadlock. The fix introduced here is to stop accepting new permissions
    events when a listener is shutting down and to grant permission for all
    outstanding events. Thus the original task will eventually release the srcu
    lock and the listener can complete shutdown.

    Reported-by: Andreas Gruenbacher
    Cc: Andreas Gruenbacher
    Signed-off-by: Eric Paris

    Eric Paris
     

13 Aug, 2010

1 commit

  • This reverts commit 3bcf3860a4ff9bbc522820b4b765e65e4deceb3e (and the
    accompanying commit c1e5c954020e "vfs/fsnotify: fsnotify_close can delay
    the final work in fput" that was a horribly ugly hack to make it work at
    all).

    The 'struct file' approach not only causes that disgusting hack, it
    somehow breaks pulseaudio, probably due to some other subtlety with
    f_count handling.

    Fix up various conflicts due to later fsnotify work.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Jul, 2010

2 commits