22 Oct, 2013

1 commit

  • Move kernel-doc notation to immediately before its function to eliminate
    kernel-doc warnings introduced by commit db14fc3abcd5 ("vfs: add
    d_walk()")

    Warning(fs/dcache.c:1343): No description found for parameter 'data'
    Warning(fs/dcache.c:1343): No description found for parameter 'dentry'
    Warning(fs/dcache.c:1343): Excess function parameter 'parent' description in 'check_mount'

    Signed-off-by: Randy Dunlap
    Cc: Miklos Szeredi
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

15 Sep, 2013

1 commit


14 Sep, 2013

1 commit

  • The LRU list changes interacted badly with our nr_dentry_unused
    accounting, and even worse with the new DCACHE_LRU_LIST bit logic.

    This introduces helper functions to make sure everything follows the
    proper dcache d_lru list rules: the dentry cache is complicated by the
    fact that some of the hotpaths don't even want to look at the LRU list
    at all, and the fact that we use the same list entry in the dentry for
    both the LRU list and for our temporary shrinking lists when removing
    things from the LRU.

    The helper functions temporarily have some extra sanity checking for the
    flag bits that have to match the current LRU state of the dentry. We'll
    remove that before the final 3.12 release, but considering how easy it
    is to get wrong, this first cleanup version has some very particular
    sanity checking.

    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Sep, 2013

7 commits

  • Pull vfs pile 4 from Al Viro:
    "list_lru pile, mostly"

    This came out of Andrew's pile, Al ended up doing the merge work so that
    Andrew didn't have to.

    Additionally, a few fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
    super: fix for destroy lrus
    list_lru: dynamically adjust node arrays
    shrinker: Kill old ->shrink API.
    shrinker: convert remaining shrinkers to count/scan API
    staging/lustre/libcfs: cleanup linux-mem.h
    staging/lustre/ptlrpc: convert to new shrinker API
    staging/lustre/obdclass: convert lu_object shrinker to count/scan API
    staging/lustre/ldlm: convert to shrinkers to count/scan API
    hugepage: convert huge zero page shrinker to new shrinker API
    i915: bail out earlier when shrinker cannot acquire mutex
    drivers: convert shrinkers to new count/scan API
    fs: convert fs shrinkers to new scan/count API
    xfs: fix dquot isolation hang
    xfs-convert-dquot-cache-lru-to-list_lru-fix
    xfs: convert dquot cache lru to list_lru
    xfs: rework buffer dispose list tracking
    xfs-convert-buftarg-lru-to-generic-code-fix
    xfs: convert buftarg LRU to generic code
    fs: convert inode and dentry shrinking to be node aware
    vmscan: per-node deferred work
    ...

    Linus Torvalds
     
  • This avoids the spinlocks and refcounts in the d_path() sequence too
    (used by /proc and various other entities). See commit 8b19e34188a3 for
    the equivalent getcwd() system call path.

    And unlike getcwd(), d_path() doesn't copy the result to user space, so
    I don't need to fear _that_ particular bug happening again.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • It's a pathname. It should use the pathname allocators and
    deallocators, and PATH_MAX instead of PAGE_SIZE. Never mind that the
    two are commonly the same.

    With this, the allocations scale up nicely too, and I can do getcwd()
    system calls at a rate of about 300M/s, with no lock contention
    anywhere.

    Of course, nobody sane does that, especially since getcwd() is
    traditionally a very slow operation in Unix. But this was also the
    simplest way to benchmark the prepend_path() improvements by Waiman, and
    once I saw the profiles I couldn't leave it well enough alone.

    But apart from being an performance improvement (from using per-cpu slab
    allocators instead of the raw page allocator), it's actually a valid and
    real cleanup.

    Signed-off-by: Linus "OCD" Torvalds

    Linus Torvalds
     
  • Oops. That wasn't very smart. We don't actually need the RCU lock any
    more by the time we copy the cwd string to user space, but I had
    stupidly surrounded the whole thing with it.

    Introduced by commit 8b19e34188a3 ("vfs: make getcwd() get the root and
    pwd path under rcu")

    Is-a-big-hairy-idiot: Linus Torvalds

    Linus Torvalds
     
  • This allows us to skip all the crazy spinlocks and reference count
    updates, and instead use the fs sequence read-lock to get an atomic
    snapshot of the root and cwd information.

    We might want to make the rule that "prepend_path()" is always called
    with the RCU lock held, but the RCU lock nests fine and this is the
    minimal fix.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Let's not pollute the include files with inline functions that are only
    used in a single place. Especially not if we decide we might want to
    change the semantics of said function to make it more efficient..

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This patch modifies read_seqbegin_or_lock() and need_seqretry() to use
    newly introduced read_seqlock_excl() and read_sequnlock_excl()
    primitives so that they won't change the sequence number even if they
    fall back to take the lock. This is OK as no change to the protected
    data structure is being made.

    It will prevent one fallback to lock taking from cascading into a series
    of lock taking reducing performance because of the sequence number
    change. It will also allow other sequence readers to go forward while
    an exclusive reader lock is taken.

    This patch also updates some of the inaccurate comments in the code.

    Signed-off-by: Waiman Long
    To: Alexander Viro
    Signed-off-by: Linus Torvalds

    Waiman Long
     

11 Sep, 2013

9 commits

  • Now that the shrinker is passing a node in the scan control structure, we
    can pass this to the the generic LRU list code to isolate reclaim to the
    lists on matching nodes.

    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • The list_lru implementation has one function, list_lru_dispose_all, with
    only one user (the dentry code). At first, such function appears to make
    sense because we are really not interested in the result of isolating each
    dentry separately - all of them are going away anyway. However, it's
    implementation is buggy in the following way:

    When we call list_lru_dispose_all in fs/dcache.c, we scan all dentries
    marking them with DCACHE_SHRINK_LIST. However, this is done without the
    nlru->lock taken. The imediate result of that is that someone else may
    add or remove the dentry from the LRU at the same time. When list_lru_del
    happens in that scenario we will see an element that is not yet marked
    with DCACHE_SHRINK_LIST (even though it will be in the future) and
    obviously remove it from an lru where the element no longer is. Since
    list_lru_dispose_all will in effect count down nlru's nr_items and
    list_lru_del will do the same, this will lead to an imbalance.

    The solution for this would not be so simple: we can obviously just keep
    the lru_lock taken, but then we have no guarantees that we will be able to
    acquire the dentry lock (dentry->d_lock). To properly solve this, we need
    a communication mechanism between the lru and dentry code, so they can
    coordinate this with each other.

    Such mechanism already exists in the form of the list_lru_walk_cb
    callback. So it is possible to construct a dcache-side prune function
    that does the right thing only by calling list_lru_walk in a loop until no
    more dentries are available.

    With only one user, plus the fact that a sane solution for the problem
    would involve boucing between dcache and list_lru anyway, I see little
    justification to keep the special case list_lru_dispose_all in tree.

    Signed-off-by: Glauber Costa
    Cc: Michal Hocko
    Acked-by: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • [glommer@openvz.org: don't reintroduce double decrement of nr_unused_dentries, adapted for new LRU return codes]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Convert superblock shrinker to use the new count/scan API, and propagate
    the API changes through to the filesystem callouts. The filesystem
    callouts already use a count/scan API, so it's just changing counters to
    longs to match the VM API.

    This requires the dentry and inode shrinker callouts to be converted to
    the count/scan API. This is mainly a mechanical change.

    [glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • One of the big problems with modifying the way the dcache shrinker and LRU
    implementation works is that the LRU is abused in several ways. One of
    these is shrink_dentry_list().

    Basically, we can move a dentry off the LRU onto a different list without
    doing any accounting changes, and then use dentry_lru_prune() to remove it
    from what-ever list it is now on to do the LRU accounting at that point.

    This makes it -really hard- to change the LRU implementation. The use of
    the per-sb LRU lock serialises movement of the dentries between the
    different lists and the removal of them, and this is the only reason that
    it works. If we want to break up the dentry LRU lock and lists into, say,
    per-node lists, we remove the only serialisation that allows this lru
    list/dispose list abuse to work.

    To make this work effectively, the dispose list has to be isolated from
    the LRU list - dentries have to be removed from the LRU *before* being
    placed on the dispose list. This means that the LRU accounting and
    isolation is completed before disposal is started, and that means we can
    change the LRU implementation freely in future.

    This means that dentries *must* be marked with DCACHE_SHRINK_LIST when
    they are placed on the dispose list so that we don't think that parent
    dentries found in try_prune_one_dentry() are on the LRU when the are
    actually on the dispose list. This would result in accounting the dentry
    to the LRU a second time. Hence dentry_lru_del() has to handle the
    DCACHE_SHRINK_LIST case

    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • With the dentry LRUs being per-sb structures, there is no real need for
    a global dentry_lru_lock. The locking can be made more fine-grained by
    moving to a per-sb LRU lock, isolating the LRU operations of different
    filesytsems completely from each other. The need for this is independent
    of any performance consideration that may arise: in the interest of
    abstracting the lru operations away, it is mandatory that each lru works
    around its own lock instead of a global lock for all of them.

    [glommer@openvz.org: updated changelog ]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Reviewed-by: Christoph Hellwig
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • Before we split up the dcache_lru_lock, the unused dentry counter needs to
    be made independent of the global dcache_lru_lock. Convert it to per-cpu
    counters to do this.

    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Reviewed-by: Christoph Hellwig
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • This series reworks our current object cache shrinking infrastructure in
    two main ways:

    * Noticing that a lot of users copy and paste their own version of LRU
    lists for objects, we put some effort in providing a generic version.
    It is modeled after the filesystem users: dentries, inodes, and xfs
    (for various tasks), but we expect that other users could benefit in
    the near future with little or no modification. Let us know if you
    have any issues.

    * The underlying list_lru being proposed automatically and
    transparently keeps the elements in per-node lists, and is able to
    manipulate the node lists individually. Given this infrastructure, we
    are able to modify the up-to-now hammer called shrink_slab to proceed
    with node-reclaim instead of always searching memory from all over like
    it has been doing.

    Per-node lru lists are also expected to lead to less contention in the lru
    locks on multi-node scans, since we are now no longer fighting for a
    global lock. The locks usually disappear from the profilers with this
    change.

    Although we have no official benchmarks for this version - be our guest to
    independently evaluate this - earlier versions of this series were
    performance tested (details at
    http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
    visible performance regressions while yielding a better qualitative
    behavior in NUMA machines.

    With this infrastructure in place, we can use the list_lru entry point to
    provide memcg isolation and per-memcg targeted reclaim. Historically,
    those two pieces of work have been posted together. This version presents
    only the infrastructure work, deferring the memcg work for a later time,
    so we can focus on getting this part tested. You can see more about the
    history of such work at http://lwn.net/Articles/552769/

    Dave Chinner (18):
    dcache: convert dentry_stat.nr_unused to per-cpu counters
    dentry: move to per-sb LRU locks
    dcache: remove dentries from LRU before putting on dispose list
    mm: new shrinker API
    shrinker: convert superblock shrinkers to new API
    list: add a new LRU list type
    inode: convert inode lru list to generic lru list code.
    dcache: convert to use new lru list infrastructure
    list_lru: per-node list infrastructure
    shrinker: add node awareness
    fs: convert inode and dentry shrinking to be node aware
    xfs: convert buftarg LRU to generic code
    xfs: rework buffer dispose list tracking
    xfs: convert dquot cache lru to list_lru
    fs: convert fs shrinkers to new scan/count API
    drivers: convert shrinkers to new count/scan API
    shrinker: convert remaining shrinkers to count/scan API
    shrinker: Kill old ->shrink API.

    Glauber Costa (7):
    fs: bump inode and dentry counters to long
    super: fix calculation of shrinkable objects for small numbers
    list_lru: per-node API
    vmscan: per-node deferred work
    i915: bail out earlier when shrinker cannot acquire mutex
    hugepage: convert huge zero page shrinker to new shrinker API
    list_lru: dynamically adjust node arrays

    This patch:

    There are situations in very large machines in which we can have a large
    quantity of dirty inodes, unused dentries, etc. This is particularly true
    when umounting a filesystem, where eventually since every live object will
    eventually be discarded.

    Dave Chinner reported a problem with this while experimenting with the
    shrinker revamp patchset. So we believe it is time for a change. This
    patch just moves int to longs. Machines where it matters should have a
    big long anyway.

    Signed-off-by: Glauber Costa
    Cc: Dave Chinner
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: Dave Chinner
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • Pull vfs pile 3 (of many) from Al Viro:
    "Waiman's conversion of d_path() and bits related to it,
    kern_path_mountpoint(), several cleanups and fixes (exportfs
    one is -stable fodder, IMO).

    There definitely will be more... ;-/"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    split read_seqretry_or_unlock(), convert d_walk() to resulting primitives
    dcache: Translating dentry into pathname without taking rename_lock
    autofs4 - fix device ioctl mount lookup
    introduce kern_path_mountpoint()
    rename user_path_umountat() to user_path_mountpoint_at()
    take unlazy_walk() into umount_lookup_last()
    Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c
    prune_super(): sb->s_op is never NULL
    exportfs: don't assume that ->iterate() won't feed us too long entries
    afs: get rid of redundant ->d_name.len checks

    Linus Torvalds
     

10 Sep, 2013

2 commits

  • Separate "check if we need to retry" from "unlock if we are done and
    had seq_writelock"; that allows to use these guys in d_walk(), where
    we need to recheck every time we ascend back to parent, but do *not*
    want to unlock until the very end. Lift rcu_read_lock/rcu_read_unlock
    out into callers.

    Signed-off-by: Al Viro

    Al Viro
     
  • When running the AIM7's short workload, Linus' lockref patch eliminated
    most of the spinlock contention. However, there were still some left:

    8.46% reaim [kernel.kallsyms] [k] _raw_spin_lock
    |--42.21%-- d_path
    | proc_pid_readlink
    | SyS_readlinkat
    | SyS_readlink
    | system_call
    | __GI___readlink
    |
    |--40.97%-- sys_getcwd
    | system_call
    | __getcwd

    The big one here is the rename_lock (seqlock) contention in d_path()
    and the getcwd system call. This patch will eliminate the need to take
    the rename_lock while translating dentries into the full pathnames.

    The need to take the rename_lock is to make sure that no rename
    operation can be ongoing while the translation is in progress. However,
    only one thread can take the rename_lock thus blocking all the other
    threads that need it even though the translation process won't make
    any change to the dentries.

    This patch will replace the writer's write_seqlock/write_sequnlock
    sequence of the rename_lock of the callers of the prepend_path() and
    __dentry_path() functions with the reader's read_seqbegin/read_seqretry
    sequence within these 2 functions. As a result, the code will have to
    retry if one or more rename operations had been performed. In addition,
    RCU read lock will be taken during the translation process to make sure
    that no dentries will go away. To prevent live-lock from happening,
    the code will switch back to take the rename_lock if read_seqretry()
    fails for three times.

    To further reduce spinlock contention, this patch does not take the
    dentry's d_lock when copying the filename from the dentries. Instead,
    it treats the name pointer and length as unreliable and just copy
    the string byte-by-byte over until it hits a null byte or the end of
    string as specified by the length. This should avoid stepping into
    invalid memory address. The error cases are left to be handled by
    the sequence number check.

    The following code re-factoring are also made:
    1. Move prepend('/') into prepend_name() to remove one conditional
    check.
    2. Move the global root check in prepend_path() back to the top of
    the while loop.

    With this patch, the _raw_spin_lock will now account for only 1.2%
    of the total CPU cycles for the short workload. This patch also has
    the effect of reducing the effect of running perf on its profile
    since the perf command itself can be a heavy user of the d_path()
    function depending on the complexity of the workload.

    When taking the perf profile of the high-systime workload, the amount
    of spinlock contention contributed by running perf without this patch
    was about 16%. With this patch, the spinlock contention caused by
    the running of perf will go away and we will have a more accurate
    perf profile.

    Signed-off-by: Waiman Long
    Signed-off-by: Al Viro

    Waiman Long
     

09 Sep, 2013

2 commits

  • This simplifies the RCU to refcounting code in particular.

    I was originally intending to leave this for later, but walking through
    all the dput() logic (see previous commit), I realized that the dput()
    "might_sleep()" check was misleadingly weak. And I removed it as
    misleading, both for performance profiling and for debugging.

    However, the might_sleep() debugging case is actually true: the final
    dput() can indeed sleep, if the inode of the dentry that you are
    releasing ends up sleeping at iput time (see dentry_iput()). So the
    problem with the might_sleep() in dput() wasn't that it wasn't true, it
    was that it wasn't actually testing and triggering on the interesting
    case.

    In particular, just about *any* dput() can indeed sleep, if you happen
    to race with another thread deleting the file in question, and you then
    lose the race to the be the last dput() for that file. But because it's
    a very rare race, the debugging code would never trigger it in practice.

    Why is this problematic? The new d_rcu_to_refcount() (see commit
    15570086b590: "vfs: reimplement d_rcu_to_refcount() using
    lockref_get_or_lock()") does a dput() for the failure case, and it does
    it under the RCU lock. So potentially sleeping really is a bug.

    But there's no way I'm going to fix this with the previous complicated
    "lockref_get_or_lock()" interface. And rather than revert to the old
    and crufty nested dentry locking code (which did get this right by
    delaying the reference count updates until they were verified to be
    safe), let's make forward progress.

    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This is me being a bit OCD after all the dentry optimization work this
    merge window: profiles end up showing 'dput()' as a rather expensive
    operation, and there were two unrelated bad reasons for that.

    The first reason was reading d_lockref.count for debugging purposes,
    which touches the lockref cacheline (for reads) before really need to.
    More importantly, the debugging test in question is _wrong_, and has
    hidden bugs. It's true that we can only sleep when the count goes down
    to zero, but the test as-is hides the much more subtle bug that happens
    if we race with somebody else deleting the file.

    Anyway we _will_ touch that cacheline, but let's do it for a write and
    in the right routine (ie in "lockref_put_or_lock()") which annotates the
    costs better. So remove the misleading debug code.

    The other was an unnecessary access to the cacheline that contains the
    d_lru list, just to check whether we already were on the LRU list or
    not. This is exactly what we have d_flags for, so that we can avoid
    touching extra cache lines for the common case. So just add another bit
    for "is this dentry on the LRU".

    Finally, mark the tests properly likely/unlikely, so that the common
    fast-paths are dense in the instruction stream.

    This makes the profiles look much saner.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Sep, 2013

4 commits

  • We check submounts before doing d_drop() on a non-empty directory dentry in
    NFS (have_submounts()), but we do not exclude a racing mount. Nor do we
    prevent mounts to be added to the disconnected subtree using relative paths
    after the d_drop().

    This patch fixes these issues by checking for unlinked (unhashed, non-root)
    ancestors before proceeding with the mount. This is done with rename
    seqlock taken for write and with ->d_lock grabbed on each ancestor in turn,
    including our dentry itself. This ensures that the only one of
    check_submounts_and_drop() or has_unlinked_ancestor() can succeed.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • We check submounts before doing d_drop() on a non-empty directory dentry in
    NFS (have_submounts()), but we do not exclude a racing mount.

    Process A: have_submounts() -> returns false
    Process B: mount() -> success
    Process A: d_drop()

    This patch prepares the ground for the fix by doing the following
    operations all under the same rename lock:

    have_submounts()
    shrink_dcache_parent()
    d_drop()

    This is actually an optimization since have_submounts() and
    shrink_dcache_parent() both traverse the same dentry tree separately.

    Signed-off-by: Miklos Szeredi
    CC: David Howells
    CC: Steven Whitehouse
    CC: Trond Myklebust
    CC: Greg Kroah-Hartman
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • This one replaces three instances open coded tree walking (have_submounts,
    select_parent, d_genocide) with a common helper.

    In addition to slightly reducing the kernel size, this simplifies the
    callers and makes them less bug prone.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • It shouldn't matter when we decrement the refcount during the walk as long
    as we do it exactly once.

    Restructure d_genocide() to do the killing on entering the dentry instead
    of when leaving it. This helps creating a common helper for tree walking.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

05 Sep, 2013

1 commit

  • Pull vfs pile 1 from Al Viro:
    "Unfortunately, this merge window it'll have a be a lot of small piles -
    my fault, actually, for not keeping #for-next in anything that would
    resemble a sane shape ;-/

    This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
    cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
    components) + several long-standing patches from various folks.

    There definitely will be a lot more (starting with Miklos'
    check_submount_and_drop() series)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    direct-io: Handle O_(D)SYNC AIO
    direct-io: Implement generic deferred AIO completions
    add formats for dentry/file pathnames
    kvm eventfd: switch to fdget
    powerpc kvm: use fdget
    switch fchmod() to fdget
    switch epoll_ctl() to fdget
    switch copy_module_from_fd() to fdget
    git simplify nilfs check for busy subtree
    ibmasmfs: don't bother passing superblock when not needed
    don't pass superblock to hypfs_{mkdir,create*}
    don't pass superblock to hypfs_diag_create_files
    don't pass superblock to hypfs_vm_create_files()
    oprofile: get rid of pointless forward declarations of struct super_block
    oprofilefs_create_...() do not need superblock argument
    oprofilefs_mkdir() doesn't need superblock argument
    don't bother with passing superblock to oprofile_create_stats_files()
    oprofile: don't bother with passing superblock to ->create_files()
    don't bother passing sb to oprofile_create_files()
    coh901318: don't open-code simple_read_from_buffer()
    ...

    Linus Torvalds
     

04 Sep, 2013

1 commit

  • The d_prune dentry operation is used to notify filesystem when VFS
    about to prune a hashed dentry from the dcache. There are three
    code paths that prune dentries: shrink_dcache_for_umount_subtree(),
    prune_dcache_sb() and d_prune_aliases(). For the d_prune_aliases()
    case, VFS unhashes the dentry first, then call the d_prune dentry
    operation. This confuses ceph_d_prune() (ceph uses the d_prune
    dentry operation to maintain a flag indicating whether the complete
    contents of a directory are in the dcache, pruning unhashed dentry
    does not affect dir's completeness)

    This patch fixes the issue by calling the d_prune dentry operation
    in d_prune_aliases(), before unhashing the dentry. Also make VFS
    only call the d_prune dentry operation for hashed dentry, to avoid
    calling the d_prune dentry operation twice when dentry is pruned
    by d_prune_aliases().

    Signed-off-by: Yan, Zheng
    Signed-off-by: Al Viro

    Yan, Zheng
     

03 Sep, 2013

2 commits

  • This moves __d_rcu_to_refcount() from into fs/namei.c
    and re-implements it using the lockref infrastructure instead. It also
    adds a lot of comments about what is actually going on, because turning
    a dentry that was looked up using RCU into a long-lived reference
    counted entry is one of the more subtle parts of the rcu walk.

    We also used to be _particularly_ subtle in unlazy_walk() where we
    re-validate both the dentry and its parent using the same sequence
    count. We used to do it by nesting the locks and then verifying the
    sequence count just once.

    That was silly, because nested locking is expensive, but the sequence
    count check is not. So this just re-validates the dentry and the parent
    separately, avoiding the nested locking, and making the lockref lookup
    possible.

    Acked-by: Waiman Long
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • A valid parent pointer is always going to have a non-zero reference
    count, but if we look up the parent optimistically without locking, we
    have to protect against the (very unlikely) race against renaming
    changing the parent from under us.

    We do that by using lockref_get_not_zero(), and then re-checking the
    parent pointer after getting a valid reference.

    [ This is a re-implementation of a chunk from the original patch by
    Waiman Long: "dcache: Enable lockless update of dentry's refcount".
    I've completely rewritten the patch-series and split it up, but I'm
    attributing this part to Waiman as it's close enough to his earlier
    patch - Linus ]

    Signed-off-by: Waiman Long
    Signed-off-by: Linus Torvalds

    Waiman Long
     

29 Aug, 2013

1 commit

  • This just replaces the dentry count/lock combination with the lockref
    structure that contains both a count and a spinlock, and does the
    mechanical conversion to use the lockref infrastructure.

    There are no semantic changes here, it's purely syntactic. The
    reference lockref implementation uses the spinlock exactly the same way
    that the old dcache code did, and the bulk of this patch is just
    expanding the internal "d_count" use in the dcache code to use
    "d_lockref.count" instead.

    This is purely preparation for the real change to make the reference
    count updates be lockless during the 3.12 merge window.

    [ As with the previous commit, this is a rewritten version of a concept
    originally from Waiman, so credit goes to him, blame for any errors
    goes to me.

    Waiman's patch had some semantic differences for taking advantage of
    the lockless update in dget_parent(), while this patch is
    intentionally a pure search-and-replace change with no semantic
    changes. - Linus ]

    Signed-off-by: Waiman Long
    Signed-off-by: Linus Torvalds

    Waiman Long
     

25 Aug, 2013

1 commit


04 Jul, 2013

1 commit

  • Pull second set of VFS changes from Al Viro:
    "Assorted f_pos race fixes, making do_splice_direct() safe to call with
    i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
    ->d_hash/->d_compare calling conventions changes from Linus, misc
    stuff all over the place."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    Document ->tmpfile()
    ext4: ->tmpfile() support
    vfs: export lseek_execute() to modules
    lseek_execute() doesn't need an inode passed to it
    block_dev: switch to fixed_size_llseek()
    cpqphp_sysfs: switch to fixed_size_llseek()
    tile-srom: switch to fixed_size_llseek()
    proc_powerpc: switch to fixed_size_llseek()
    ubi/cdev: switch to fixed_size_llseek()
    pci/proc: switch to fixed_size_llseek()
    isapnp: switch to fixed_size_llseek()
    lpfc: switch to fixed_size_llseek()
    locks: give the blocked_hash its own spinlock
    locks: add a new "lm_owner_key" lock operation
    locks: turn the blocked_list into a hashtable
    locks: convert fl_link to a hlist_node
    locks: avoid taking global lock if possible when waking up blocked waiters
    locks: protect most of the file_lock handling with i_lock
    locks: encapsulate the fl_link list handling
    locks: make "added" in __posix_lock_file a bool
    ...

    Linus Torvalds
     

29 Jun, 2013

3 commits


14 Jun, 2013

1 commit

  • I've restricted atomic_open to only operate on regular files, although
    I still don't understand why atomic_open should not be possible also for
    directories on GFS2. That can always be added in later though, if it
    makes sense.

    The ->atomic_open function can be passed negative dentries, which
    in most cases means either ENOENT (->lookup) or a call to d_instantiate
    (->create). In the GFS2 case though, we need to actually perform the
    look up, since we do not know whether there has been a new inode created
    on another node. The look up calls d_splice_alias which then tries to
    rehash the dentry - so the solution here is to simply check for that
    in d_splice_alias. The same issue is likely to affect any other cluster
    filesystem implementing ->atomic_open

    Signed-off-by: Steven Whitehouse
    Cc: Al Viro
    Cc: "J. Bruce Fields"
    Cc: Jeff Layton

    Steven Whitehouse
     

05 May, 2013

2 commits

  • Using list_move() instead of list_del() + list_add().

    Signed-off-by: Wei Yongjun
    Signed-off-by: Al Viro

    Wei Yongjun
     
  • When pruning a dentry, its ancestor dentry can also be pruned. But
    the ancestor dentry does not go through dput(), so it does not get
    put on the dentry LRU. Hence associating d_prune with removing the
    dentry from the LRU is the wrong.

    The fix is remove dentry_lru_prune(). Call file system's d_prune()
    callback directly when pruning dentries.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Al Viro

    Yan, Zheng