10 Jan, 2011

1 commit

  • Fix new kernel-doc notation warnings in fs/namei.c and spell
    ECHILD correctly.

    Warning(fs/namei.c:218): No description found for parameter 'flags'
    Warning(fs/namei.c:425): Excess function parameter 'Returns' description in 'nameidata_drop_rcu'
    Warning(fs/namei.c:478): Excess function parameter 'Returns' description in 'nameidata_dentry_drop_rcu'
    Warning(fs/namei.c:540): Excess function parameter 'Returns' description in 'nameidata_drop_rcu_last'

    Signed-off-by: Randy Dunlap
    Cc: Nick Piggin
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

07 Jan, 2011

9 commits

  • The problem that this patch aims to fix is vfsmount refcounting scalability.
    We need to take a reference on the vfsmount for every successful path lookup,
    which often go to the same mount point.

    The fundamental difficulty is that a "simple" reference count can never be made
    scalable, because any time a reference is dropped, we must check whether that
    was the last reference. To do that requires communication with all other CPUs
    that may have taken a reference count.

    We can make refcounts more scalable in a couple of ways, involving keeping
    distributed counters, and checking for the global-zero condition less
    frequently.

    - check the global sum once every interval (this will delay zero detection
    for some interval, so it's probably a showstopper for vfsmounts).

    - keep a local count and only taking the global sum when local reaches 0 (this
    is difficult for vfsmounts, because we can't hold preempt off for the life of
    a reference, so a counter would need to be per-thread or tied strongly to a
    particular CPU which requires more locking).

    - keep a local difference of increments and decrements, which allows us to sum
    the total difference and hence find the refcount when summing all CPUs. Then,
    keep a single integer "long" refcount for slow and long lasting references,
    and only take the global sum of local counters when the long refcount is 0.

    This last scheme is what I implemented here. Attached mounts and process root
    and working directory references are "long" references, and everything else is
    a short reference.

    This allows scalable vfsmount references during path walking over mounted
    subtrees and unattached (lazy umounted) mounts with processes still running
    in them.

    This results in one fewer atomic op in the fastpath: mntget is now just a
    per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
    and non-atomic decrement in the common case. However code is otherwise bigger
    and heavier, so single threaded performance is basically a wash.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Require filesystems be aware of .d_revalidate being called in rcu-walk
    mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
    -ECHILD from all implementations.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Reduce some branches and memory accesses in dcache lookup by adding dentry
    flags to indicate common d_ops are set, rather than having to check them.
    This saves a pointer memory access (dentry->d_op) in common path lookup
    situations, and saves another pointer load and branch in cases where we
    have d_op but not the particular operation.

    Patched with:

    git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Use a seqlock in the fs_struct to enable us to take an atomic copy of the
    complete cwd and root paths. Use this in the RCU lookup path to avoid a
    thread-shared spinlock in RCU lookup operations.

    Multi-threaded apps may now perform path lookups with scalability matching
    multi-process apps. Operations such as stat(2) become very scalable for
    multi-threaded workload.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Perform common cases of path lookups without any stores or locking in the
    ancestor dentry elements. This is called rcu-walk, as opposed to the current
    algorithm which is a refcount based walk, or ref-walk.

    This results in far fewer atomic operations on every path element,
    significantly improving path lookup performance. It also avoids cacheline
    bouncing on common dentries, significantly improving scalability.

    The overall design is like this:
    * LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
    * Take the RCU lock for the entire path walk, starting with the acquiring
    of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
    not required for dentry persistence.
    * synchronize_rcu is called when unregistering a filesystem, so we can
    access d_ops and i_ops during rcu-walk.
    * Similarly take the vfsmount lock for the entire path walk. So now mnt
    refcounts are not required for persistence. Also we are free to perform mount
    lookups, and to assume dentry mount points and mount roots are stable up and
    down the path.
    * Have a per-dentry seqlock to protect the dentry name, parent, and inode,
    so we can load this tuple atomically, and also check whether any of its
    members have changed.
    * Dentry lookups (based on parent, candidate string tuple) recheck the parent
    sequence after the child is found in case anything changed in the parent
    during the path walk.
    * inode is also RCU protected so we can load d_inode and use the inode for
    limited things.
    * i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
    * i_op can be loaded.

    When we reach the destination dentry, we lock it, recheck lookup sequence,
    and increment its refcount and mountpoint refcount. RCU and vfsmount locks
    are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
    not match, we can not drop rcu-walk gracefully at the current point in the
    lokup, so instead return -ECHILD (for want of a better errno). This signals the
    path walking code to re-do the entire lookup with a ref-walk.

    Aside from the final dentry, there are other situations that may be encounted
    where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
    a reference on the last good dentry) and continue with a ref-walk. Again, if
    we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
    using ref-walk. But it is very important that we can continue with ref-walk
    for most cases, particularly to avoid the overhead of double lookups, and to
    gain the scalability advantages on common path elements (like cwd and root).

    The cases where rcu-walk cannot continue are:
    * NULL dentry (ie. any uncached path element)
    * parent with d_inode->i_op->permission or ACLs
    * dentries with d_revalidate
    * Following links

    In future patches, permission checks and d_revalidate become rcu-walk aware. It
    may be possible eventually to make following links rcu-walk aware.

    Uncached path elements will always require dropping to ref-walk mode, at the
    very least because i_mutex needs to be grabbed, and objects allocated.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • dcache_lock no longer protects anything. remove it.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
    0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
    we start protecting many other dentry members with d_lock.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Change d_hash so it may be called from lock-free RCU lookups. See similar
    patch for d_compare for details.

    For in-tree filesystems, this is just a mechanical change.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

08 Dec, 2010

1 commit


29 Oct, 2010

1 commit

  • nameidata_to_filp() drops nd->path or transfers it to opened
    file. In the former case it's a Bad Idea(tm) to do mnt_drop_write()
    on nd->path.mnt, since we might race with umount and vfsmount in
    question might be gone already.

    Fix: don't drop it, then... IOW, have nameidata_to_filp() grab nd->path
    in case it transfers it to file and do path_drop() in callers. After
    they are through with accessing nd->path...

    Reported-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Al Viro
     

26 Oct, 2010

2 commits


18 Aug, 2010

4 commits

  • fs: brlock vfsmount_lock

    Use a brlock for the vfsmount lock. It must be taken for write whenever
    modifying the mount hash or associated fields, and may be taken for read when
    performing mount hash lookups.

    A new lock is added for the mnt-id allocator, so it doesn't need to take
    the heavy vfsmount write-lock.

    The number of atomics should remain the same for fastpath rlock cases, though
    code would be slightly slower due to per-cpu access. Scalability is not not be
    much improved in common cases yet, due to other locks (ie. dcache_lock) getting
    in the way. However path lookups crossing mountpoints should be one case where
    scalability is improved (currently requiring the global lock).

    The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
    Altix system (high latency to remote nodes), a simple umount microbenchmark
    (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
    took 6.8s, afterwards took 7.1s, about 5% slower.

    Cc: Al Viro
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • fs: remove extra lookup in __lookup_hash

    Optimize lookup for create operations, where no dentry should often be
    common-case. In cases where it is not, such as unlink, the added overhead
    is much smaller than the removed.

    Also, move comments about __d_lookup racyness to the __d_lookup call site.
    d_lookup is intuitive; __d_lookup is what needs commenting. So in that same
    vein, add kerneldoc comments to __d_lookup and clean up some of the comments:

    - We are interested in how the RCU lookup works here, particularly with
    renames. Make that explicit, and point to the document where it is explained
    in more detail.
    - RCU is pretty standard now, and macros make implementations pretty mindless.
    If we want to know about RCU barrier details, we look in RCU code.
    - Delete some boring legacy comments because we don't care much about how the
    code used to work, more about the interesting parts of how it works now. So
    comments about lazy LRU may be interesting, but would better be done in the
    LRU or refcount management code.

    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • fs: dentry allocation consolidation

    There are 2 duplicate copies of code in dentry allocation in path lookup.
    Consolidate them into a single function.

    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • fs: fix do_lookup false negative

    In do_lookup, if we initially find no dentry, we take the directory i_mutex and
    re-check the lookup. If we find a dentry there, then we revalidate it if
    needed. However if that revalidate asks for the dentry to be invalidated, we
    return -ENOENT from do_lookup. What should happen instead is an attempt to
    allocate and lookup a new dentry.

    This is probably not noticed because it is rare. It is only reached if a
    concurrent create races in first (in which case, the dentry probably won't be
    invalidated anyway), or if the racy __d_lookup has failed due to a
    false-negative (which is very rare).

    Fix this by removing code and have it use the normal reval path.

    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     

11 Aug, 2010

2 commits

  • Add three helpers that retrieve a refcounted copy of the root and cwd
    from the supplied fs_struct.

    get_fs_root()
    get_fs_pwd()
    get_fs_root_and_pwd()

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • * 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits)
    fanotify: use both marks when possible
    fsnotify: pass both the vfsmount mark and inode mark
    fsnotify: walk the inode and vfsmount lists simultaneously
    fsnotify: rework ignored mark flushing
    fsnotify: remove global fsnotify groups lists
    fsnotify: remove group->mask
    fsnotify: remove the global masks
    fsnotify: cleanup should_send_event
    fanotify: use the mark in handler functions
    audit: use the mark in handler functions
    dnotify: use the mark in handler functions
    inotify: use the mark in handler functions
    fsnotify: send fsnotify_mark to groups in event handling functions
    fsnotify: Exchange list heads instead of moving elements
    fsnotify: srcu to protect read side of inode and vfsmount locks
    fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called
    fsnotify: use _rcu functions for mark list traversal
    fsnotify: place marks on object in order of group memory address
    vfs/fsnotify: fsnotify_close can delay the final work in fput
    fsnotify: store struct file not struct path
    ...

    Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.

    Linus Torvalds
     

02 Aug, 2010

2 commits

  • SELinux needs to pass the MAY_ACCESS flag so it can handle auditting
    correctly. Presently the masking of MAY_* flags is done in the VFS. In
    order to allow LSMs to decide what flags they care about and what flags
    they don't just pass them all and the each LSM mask off what they don't
    need. This patch should contain no functional changes to either the VFS or
    any LSM.

    Signed-off-by: Eric Paris
    Acked-by: Stephen D. Smalley
    Signed-off-by: James Morris

    Eric Paris
     
  • When commit be6d3e56a6b9b3a4ee44a0685e39e595073c6f0d "introduce new LSM hooks
    where vfsmount is available." was proposed, regarding security_path_truncate(),
    only "struct file *" argument (which AppArmor wanted to use) was removed.
    But length and time_attrs arguments are not used by TOMOYO nor AppArmor.
    Thus, let's remove these arguments.

    Signed-off-by: Tetsuo Handa
    Acked-by: Nick Piggin
    Signed-off-by: James Morris

    Tetsuo Handa
     

28 Jul, 2010

1 commit

  • fsnotify was using char * when it passed around the d_name.name string
    internally but it is actually an unsigned char *. This patch switches
    fsnotify to use unsigned and should silence some pointer signess warnings
    which have popped out of xfs. I do not add -Wpointer-sign to the fsnotify
    code as there are still issues with kstrdup and strlen which would pop
    out needless warnings.

    Signed-off-by: Eric Paris

    Eric Paris
     

28 May, 2010

1 commit

  • Commit 1f36f774b22a0ceb7dd33eca626746c81a97b6a5 broke FS_REVAL_DOT semantics.

    In particular, before this patch, the command
    ls -l
    in an NFS mounted directory would always check if the directory on the server
    had changed and if so would flush and refill the pagecache for the dir.
    After this patch, the same "ls -l" will repeatedly return stale date until
    the cached attributes for the directory time out.

    The following patch fixes this by ensuring the d_revalidate is called by
    do_last when "." is being looked-up.
    link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN
    is not set so nfs_lookup_verify_inode chooses not to do any validation.

    The following patch restores the original behaviour.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown
    Signed-off-by: Al Viro

    Neil Brown
     

22 May, 2010

1 commit


15 May, 2010

1 commit

  • 1) i_flags simply doesn't work for mount/unlink race prevention;
    we may have many links to file and rm on one of those obviously
    shouldn't prevent bind on top of another later on. To fix it
    right way we need to mark _dentry_ as unsuitable for mounting
    upon; new flag (DCACHE_CANT_MOUNT) is protected by d_flags and
    i_mutex on the inode in question. Set it (with dont_mount(dentry))
    in unlink/rmdir/etc., check (with cant_mount(dentry)) in places
    in namespace.c that used to check for S_DEAD. Setting S_DEAD
    is still needed in places where we used to set it (for directories
    getting killed), since we rely on it for readdir/rmdir race
    prevention.

    2) rename()/mount() protection has another bogosity - we unhash
    the target before we'd checked that it's not a mountpoint. Fixed.

    3) ancient bogosity in pivot_root() - we locked i_mutex on the
    right directory, but checked S_DEAD on the different (and wrong)
    one. Noticed and fixed.

    Signed-off-by: Al Viro

    Al Viro
     

13 May, 2010

1 commit

  • According to specification

    mkdir d; ln -s d a; open("a/", O_NOFOLLOW | O_RDONLY)

    should return success but currently it returns ELOOP. This is a
    regression caused by path lookup cleanup patch series.

    Fix the code to ignore O_NOFOLLOW in case the provided path has trailing
    slashes.

    Cc: Andrew Morton
    Cc: Al Viro
    Reported-by: Marius Tolzmann
    Acked-by: Miklos Szeredi
    Signed-off-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Jan Kara
     

27 Mar, 2010

1 commit


08 Mar, 2010

1 commit


07 Mar, 2010

1 commit


06 Mar, 2010

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
    quota: stop using QUOTA_OK / NO_QUOTA
    dquot: cleanup dquot initialize routine
    dquot: move dquot initialization responsibility into the filesystem
    dquot: cleanup dquot drop routine
    dquot: move dquot drop responsibility into the filesystem
    dquot: cleanup dquot transfer routine
    dquot: move dquot transfer responsibility into the filesystem
    dquot: cleanup inode allocation / freeing routines
    dquot: cleanup space allocation / freeing routines
    ext3: add writepage sanity checks
    ext3: Truncate allocated blocks if direct IO write fails to update i_size
    quota: Properly invalidate caches even for filesystems with blocksize < pagesize
    quota: generalize quota transfer interface
    quota: sb_quota state flags cleanup
    jbd: Delay discarding buffers in journal_unmap_buffer
    ext3: quota_write cross block boundary behaviour
    quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
    quota: split out compat_sys_quotactl support from quota.c
    quota: split out netlink notification support from quota.c
    quota: remove invalid optimization from quota_sync_all
    ...

    Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c

    Linus Torvalds
     

05 Mar, 2010

9 commits