04 Jul, 2013

1 commit

  • Pull second set of VFS changes from Al Viro:
    "Assorted f_pos race fixes, making do_splice_direct() safe to call with
    i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
    ->d_hash/->d_compare calling conventions changes from Linus, misc
    stuff all over the place."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    Document ->tmpfile()
    ext4: ->tmpfile() support
    vfs: export lseek_execute() to modules
    lseek_execute() doesn't need an inode passed to it
    block_dev: switch to fixed_size_llseek()
    cpqphp_sysfs: switch to fixed_size_llseek()
    tile-srom: switch to fixed_size_llseek()
    proc_powerpc: switch to fixed_size_llseek()
    ubi/cdev: switch to fixed_size_llseek()
    pci/proc: switch to fixed_size_llseek()
    isapnp: switch to fixed_size_llseek()
    lpfc: switch to fixed_size_llseek()
    locks: give the blocked_hash its own spinlock
    locks: add a new "lm_owner_key" lock operation
    locks: turn the blocked_list into a hashtable
    locks: convert fl_link to a hlist_node
    locks: avoid taking global lock if possible when waking up blocked waiters
    locks: protect most of the file_lock handling with i_lock
    locks: encapsulate the fl_link list handling
    locks: make "added" in __posix_lock_file a bool
    ...

    Linus Torvalds
     

29 Jun, 2013

3 commits


14 Jun, 2013

1 commit

  • I've restricted atomic_open to only operate on regular files, although
    I still don't understand why atomic_open should not be possible also for
    directories on GFS2. That can always be added in later though, if it
    makes sense.

    The ->atomic_open function can be passed negative dentries, which
    in most cases means either ENOENT (->lookup) or a call to d_instantiate
    (->create). In the GFS2 case though, we need to actually perform the
    look up, since we do not know whether there has been a new inode created
    on another node. The look up calls d_splice_alias which then tries to
    rehash the dentry - so the solution here is to simply check for that
    in d_splice_alias. The same issue is likely to affect any other cluster
    filesystem implementing ->atomic_open

    Signed-off-by: Steven Whitehouse
    Cc: Al Viro
    Cc: "J. Bruce Fields"
    Cc: Jeff Layton

    Steven Whitehouse
     

05 May, 2013

2 commits

  • Using list_move() instead of list_del() + list_add().

    Signed-off-by: Wei Yongjun
    Signed-off-by: Al Viro

    Wei Yongjun
     
  • When pruning a dentry, its ancestor dentry can also be pruned. But
    the ancestor dentry does not go through dput(), so it does not get
    put on the dentry LRU. Hence associating d_prune with removing the
    dentry from the LRU is the wrong.

    The fix is remove dentry_lru_prune(). Call file system's d_prune()
    callback directly when pruning dentries.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Al Viro

    Yan, Zheng
     

01 May, 2013

1 commit

  • Call cond_resched() in shrink_dcache_parent() to maintain interactivity.

    Before this patch:

    void shrink_dcache_parent(struct dentry * parent)
    {
    while ((found = select_parent(parent, &dispose)) != 0)
    shrink_dentry_list(&dispose);
    }

    select_parent() populates the dispose list with dentries which
    shrink_dentry_list() then deletes. select_parent() carefully uses
    need_resched() to avoid doing too much work at once. But neither
    shrink_dcache_parent() nor its called functions call cond_resched(). So
    once need_resched() is set select_parent() will return single dentry
    dispose list which is then deleted by shrink_dentry_list(). This is
    inefficient when there are a lot of dentry to process. This can cause
    softlockup and hurts interactivity on non preemptable kernels.

    This change adds cond_resched() in shrink_dcache_parent(). The benefit
    of this is that need_resched() is quickly cleared so that future calls
    to select_parent() are able to efficiently return a big batch of dentry.

    These additional cond_resched() do not seem to impact performance, at
    least for the workload below.

    Here is a program which can cause soft lockup if other system activity
    sets need_resched().

    int main()
    {
    struct rlimit rlim;
    int i;
    int f[100000];
    char buf[20];
    struct timeval t1, t2;
    double diff;

    /* cleanup past run */
    system("rm -rf x");

    /* boost nfile rlimit */
    rlim.rlim_cur = 200000;
    rlim.rlim_max = 200000;
    if (setrlimit(RLIMIT_NOFILE, &rlim))
    err(1, "setrlimit");

    /* make directory for files */
    if (mkdir("x", 0700))
    err(1, "mkdir");

    if (gettimeofday(&t1, NULL))
    err(1, "gettimeofday");

    /* populate directory with open files */
    for (i = 0; i < 100000; i++) {
    snprintf(buf, sizeof(buf), "x/%d", i);
    f[i] = open(buf, O_CREAT);
    if (f[i] == -1)
    err(1, "open");
    }

    /* close some of the files */
    for (i = 0; i < 85000; i++)
    close(f[i]);

    /* unlink all files, even open ones */
    system("rm -rf x");

    if (gettimeofday(&t2, NULL))
    err(1, "gettimeofday");

    diff = (((double)t2.tv_sec * 1000000 + t2.tv_usec) -
    ((double)t1.tv_sec * 1000000 + t1.tv_usec));

    printf("done: %g elapsed\n", diff/1e6);
    return 0;
    }

    Signed-off-by: Greg Thelen
    Signed-off-by: Dave Chinner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

27 Mar, 2013

1 commit

  • ... lest we get livelocks between path_is_under() and d_path() and friends.

    The thing is, wrt fairness lglocks are more similar to rwsems than to rwlocks;
    it is possible to have thread B spin on attempt to take lock shared while thread
    A is already holding it shared, if B is on lower-numbered CPU than A and there's
    a thread C spinning on attempt to take the same lock exclusive.

    As the result, we need consistent ordering between vfsmount_lock (lglock) and
    rename_lock (seq_lock), even though everything that takes both is going to take
    vfsmount_lock only shared.

    Spotted-by: Brad Spengler
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

26 Feb, 2013

2 commits

  • The following set of operations on a NFS client and server will cause

    server# mkdir a
    client# cd a
    server# mv a a.bak
    client# sleep 30 # (or whatever the dir attrcache timeout is)
    client# stat .
    stat: cannot stat `.': Stale NFS file handle

    Obviously, we should not be getting an ESTALE error back there since the
    inode still exists on the server. The problem is that the lookup code
    will call d_revalidate on the dentry that "." refers to, because NFS has
    FS_REVAL_DOT set.

    nfs_lookup_revalidate will see that the parent directory has changed and
    will try to reverify the dentry by redoing a LOOKUP. That of course
    fails, so the lookup code returns ESTALE.

    The problem here is that d_revalidate is really a bad fit for this case.
    What we really want to know at this point is whether the inode is still
    good or not, but we don't really care what name it goes by or whether
    the dcache is still valid.

    Add a new d_op->d_weak_revalidate operation and have complete_walk call
    that instead of d_revalidate. The intent there is to allow for a
    "weaker" d_revalidate that just checks to see whether the inode is still
    good. This is also gives us an opportunity to kill off the FS_REVAL_DOT
    special casing.

    [AV: changed method name, added note in porting, fixed confusion re
    having it possibly called from RCU mode (it won't be)]

    Cc: NeilBrown
    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • * calling conventions change - ERR_PTR() is returned on ->d_hash() errors;
    NULL is just for dcache miss now.
    * exported, open-coded instances in ncpfs and cifs converted.

    Signed-off-by: Al Viro

    Al Viro
     

23 Feb, 2013

4 commits


21 Dec, 2012

2 commits

  • NFS appears to use d_obtain_alias() to create the root dentry rather than
    d_make_root. This can cause 'prepend_path()' to complain that the root
    has a weird name if an NFS filesystem is lazily unmounted. e.g. if
    "/mnt" is an NFS mount then

    { cd /mnt; umount -l /mnt ; ls -l /proc/self/cwd; }

    will cause a WARN message like
    WARNING: at /home/git/linux/fs/dcache.c:2624 prepend_path+0x1d7/0x1e0()
    ...
    Root dentry has weird name <>

    to appear in kernel logs.

    So change d_obtain_alias() to use "/" rather than "" as the anonymous
    name.

    Signed-off-by: NeilBrown
    Cc: Trond Myklebust
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    NeilBrown
     
  • The code that relied on that flag was ripped out of btrfs quite some
    time ago, and never added back. Josef indicated that he was going to
    take a different approach to the problem in btrfs, and that we
    could just eliminate this flag.

    Cc: Josef Bacik
    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

03 Oct, 2012

1 commit

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     

30 Sep, 2012

1 commit

  • IBM reported a deadlock in select_parent(). This was found to be caused
    by taking rename_lock when already locked when restarting the tree
    traversal.

    There are two cases when the traversal needs to be restarted:

    1) concurrent d_move(); this can only happen when not already locked,
    since taking rename_lock protects against concurrent d_move().

    2) racing with final d_put() on child just at the moment of ascending
    to parent; rename_lock doesn't protect against this rare race, so it
    can happen when already locked.

    Because of case 2, we need to be able to handle restarting the traversal
    when rename_lock is already held. This patch fixes all three callers of
    try_to_ascend().

    IBM reported that the deadlock is gone with this patch.

    [ I rewrote the patch to be smaller and just do the "goto again" if the
    lock was already held, but credit goes to Miklos for the real work.
    - Linus ]

    Signed-off-by: Miklos Szeredi
    Cc: Al Viro
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

28 Sep, 2012

1 commit


27 Sep, 2012

1 commit

  • Each iteration of d_delete we reload inode from dentry->d_inode and
    then call S_ISDIR(inode-i_mode), so inode cannot possibly be NULL
    shortly afterwards unless something went horribly wrong.

    Signed-off-by: Alan Cox
    Signed-off-by: Al Viro

    Alan Cox
     

19 Sep, 2012

1 commit

  • IBM reported a soft lockup after applying the fix for the rename_lock
    deadlock. Commit c83ce989cb5f ("VFS: Fix the nfs sillyrename regression
    in kernel 2.6.38") was found to be the culprit.

    The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the
    dentry was killed. This flag can be set on non-killed dentries too,
    which results in infinite retries when trying to traverse the dentry
    tree.

    This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is
    only set in d_kill() and makes try_to_ascend() test only this flag.

    IBM reported successful test results with this patch.

    Signed-off-by: Miklos Szeredi
    Cc: Trond Myklebust
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

14 Jul, 2012

3 commits


09 Jun, 2012

1 commit

  • This reverts commit 7732a557b1342c6e6966efb5f07effcf99f56167 (and commit
    3f50fff4dace23d3cfeb195d5cd4ee813cee68b7, which was a follow-up
    cleanup).

    We're chasing an elusive bug that Dave Jones can apparently reproduce
    using his system call fuzzer tool, and that looks like some kind of
    locking ordering problem on the directory i_mutex chain. Our i_mutex
    locking is rather complex, and depends on the topological ordering of
    the directories, which is why we have been very wary of splicing
    directory entries around.

    Of course, we really don't want to ever see aliased unconnected
    directories anyway, so none of this should ever happen, but this revert
    aims to basically get us back to a known older state.

    Bruce points to some of the previous discussion at

    http://marc.info/?i=

    and in particular a long post from Neil:

    http://marc.info/?i=

    It should be noted that it's possible that Dave's problems come from
    other changes altohgether, including possibly just the fact that Dave
    constantly is teachning his fuzzer new tricks. So what appears to be a
    new bug could in fact be an old one that just gets newly triggered, but
    reverting these patches as "still under heavy discussion" is the right
    thing regardless.

    Requested-by: Al Viro
    Acked-by: J. Bruce Fields
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 May, 2012

2 commits

  • Nobody sets want_disconn any more.

    Reported-by: Peng Tao
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • A directory should never have more than one dentry pointing to it.

    But d_splice_alias() will add one if it finds a directory with an
    already-existing non-DISCONNECTED dentry.

    I can't find an obvious reproducer, but I also can't see what prevents
    d_splice_alias() from encountering such a case.

    It therefore seems safest to allow d_splice_alias to use any dentry it
    finds.

    (Prior to the removal of dentry_unhash() from vfs_rmdir(), around v3.0,
    this could cause an nfsd deadlock like this:

    - Somebody attempts to remove a non-empty directory.
    - The dentry_unhash() in vfs_rmdir() unhashes the dentry
    pointing to the non-empty directory.
    - ->rmdir() then fails with -ENOTEMPTY
    - Before the vfs_rmdir() caller reaches dput(), an nfsd process
    in rename looks up the directory by filehandle; at the end of
    that lookup, this dentry is found by d_alloc_anon(), and a
    reference is taken on it, preventing dput() from removing it.
    - A regular lookup of the directory calls d_splice_alias(),
    finds only an unhashed (not a DISCONNECTED) dentry, and
    insteads adds a new one, so the directory now has two
    dentries.
    - The nfsd process in rename, which was previously looking up
    the source directory of the rename, now looks up the target
    directory (which is the same), and gets the dentry newly
    created by the previous lookup.
    - The rename, seeing two different dentries, assumes this is a
    cross-directory rename and attempts to take the i_mutex on the
    directory twice.

    That reproducer no longer exists, but I don't think there was anything
    fundamentally incorrect about the vfs_rmdir() behavior there, so I think
    the real fault was here in d_splice_alias().)

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     

30 May, 2012

1 commit

  • lglocks and brlocks are currently generated with some complicated macros
    in lglock.h. But there's no reason to not just use common utility
    functions and put all the data into a common data structure.

    In preparation, this patch changes the API to look more like normal
    function calls with pointers, not magic macros.

    The patch is rather large because I move over all users in one go to keep
    it bisectable. This impacts the VFS somewhat in terms of lines changed.
    But no actual behaviour change.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Andi Kleen
    Cc: Al Viro
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Rusty Russell
    Signed-off-by: Al Viro

    Andi Kleen
     

24 May, 2012

1 commit

  • UDP stack needs a minimum hash size value for proper operation and also
    uses alloc_large_system_hash() for proper NUMA distribution of its hash
    tables and automatic sizing depending on available system memory.

    On some low memory situations, udp_table_init() must ignore the
    alloc_large_system_hash() result and reallocs a bigger memory area.

    As we cannot easily free old hash table, we leak it and kmemleak can
    issue a warning.

    This patch adds a low limit parameter to alloc_large_system_hash() to
    solve this problem.

    We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
    allocation.

    Reported-by: Mark Asselstine
    Reported-by: Tim Bird
    Signed-off-by: Eric Dumazet
    Cc: Paul Gortmaker
    Signed-off-by: David S. Miller

    Tim Bird
     

22 May, 2012

2 commits

  • This reverts commit 8c01a529b861ba97c7d78368e6a5d4d42e946f75.

    It turns out the d_unhashed() check isn't unnecessary after all: while
    it's true that unhashing will increment the sequence numbers, that does
    not necessarily invalidate the RCU lookup, because it might have seen
    the dentry pointer (before it got unhashed), but by the time it loaded
    the sequence number, it could have seen the *new* sequence number (after
    it got unhashed).

    End result: we might look up an unhashed dentry that is about to be
    freed, with the sequence number never indicating anything bad about it.
    So checking that the dentry is still hashed (*after* reading the sequence
    number) is indeed the proper fix, and was never unnecessary.

    Reported-by: Dave Jones
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Miklos Szeredi points out that we need to also worry about memory
    odering when doing the dentry name comparison asynchronously with RCU.

    In particular, doing a rename can do a memcpy() of one dentry name over
    another, and we want to make sure that any unlocked reader will always
    see the proper terminating NUL character, so that it won't ever run off
    the allocation.

    Rather than having to be extra careful with the name copy or at lookup
    time for each character, this resolves the issue by making sure that all
    names that are inlined in the dentry always have a NUL character at the
    end of the name allocation. If we do that at dentry allocation time, we
    know that no future name copy will ever change that final NUL to
    anything else, so there are no memory ordering issues.

    So even if a concurrent rename ends up overwriting the NUL character
    that terminates the original name, we always know that there is one
    final NUL at the end, and there is no worry about the lockless RCU
    lookup traversing the name too far.

    The out-of-line allocations are never copied over, so we can just make
    sure that we write the name (with terminating NULL) and do a write
    barrier before we expose the name to anything else by setting it in the
    dentry.

    Reported-by: Miklos Szeredi
    Cc: Al Viro
    Cc: Nick Piggin
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 May, 2012

4 commits

  • This allows comparing hash and len in one operation on 64-bit
    architectures. Right now only __d_lookup_rcu() takes advantage of this,
    since that is the case we care most about.

    The use of anonymous struct/unions hides the alternate 64-bit approach
    from most users, the exception being a few cases where we initialize a
    'struct qstr' with a static initializer. This makes the problematic
    cases use a new QSTR_INIT() helper function for that (but initializing
    just the name pointer with a "{ .name = xyzzy }" initializer remains
    valid, as does just copying another qstr structure).

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • All callers do want to check the dentry length, but some of them can
    check the length and the hash together, so doing it in dentry_cmp() can
    be counter-productive.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and dentry_cmp()
    interfaces") did the careful ACCESS_ONCE() of the dentry name only for
    the word-at-a-time case, even though the issue is generic.

    Admittedly I don't really see gcc ever reloading the value in the middle
    of the loop, so the ACCESS_ONCE() protects us from a fairly theoretical
    issue. But better safe than sorry.

    Also, this consolidates the common parts of the word-at-a-time and
    bytewise logic, which includes checking the length. We'll be changing
    that later.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The check for d_unhashed() is not strictly incorrect, but at the same
    time it is also not sensible. The actual dentry removal from the dentry
    hash chains is totally asynchronous to the __d_lookup_rcu() logic, and
    we depend on __d_drop() updating the sequence number to invalidate any
    lookup of an unhashed dentry.

    So checking d_unhashed() is not incorrect, but it's not useful either:
    the code has to work correctly even without it. So just remove it.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

05 May, 2012

1 commit

  • The calling conventions for __d_lookup_rcu() and dentry_cmp() are
    annoying in different ways, and there is actually one single underlying
    reason for both of the annoyances.

    The fundamental reason is that we do the returned dentry sequence number
    check inside __d_lookup_rcu() instead of doing it in the caller. This
    results in two annoyances:

    - __d_lookup_rcu() now not only needs to return the dentry and the
    sequence number that goes along with the lookup, it also needs to
    return the inode pointer that was validated by that sequence number
    check.

    - and because we did the sequence number check early (to validate the
    name pointer and length) we also couldn't just pass the dentry itself
    to dentry_cmp(), we had to pass the counted string that contained the
    name.

    So that sequence number decision caused two separate ugly calling
    conventions.

    Both of these problems would be solved if we just did the sequence
    number check in the caller instead. There's only one caller, and that
    caller already has to do the sequence number check for the parent
    anyway, so just do that.

    That allows us to stop returning the dentry->d_inode in that in-out
    argument (pointer-to-pointer-to-inode), so we can make the inode
    argument just a regular input inode pointer. The caller can just load
    the inode from dentry->d_inode, and then do the sequence number check
    after that to make sure that it's synchronized with the name we looked
    up.

    And it allows us to just pass in the dentry to dentry_cmp(), which is
    what all the callers really wanted. Sure, dentry_cmp() has to be a bit
    careful about the dentry (which is not stable during RCU lookup), but
    that's actually very simple.

    And now that dentry_cmp() can clearly see that the first string argument
    is a dentry, we can use the direct word access for that, instead of the
    careful unaligned zero-padding. The dentry name is always properly
    aligned, since it is a single path component that is either embedded
    into the dentry itself, or was allocated with kmalloc() (see __d_alloc).

    Finally, this also uninlines the nasty slow-case for dentry comparisons:
    that one *does* need to do a sequence number check, since it will call
    in to the low-level filesystems, and we want to give those a stable
    inode pointer and path component length/start arguments. Doing an extra
    sequence check for that slow case is not a problem, though.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 May, 2012

1 commit

  • It turns out that there are more cases than CONFIG_DEBUG_PAGEALLOC that
    can have holes in the kernel address space: it seems to happen easily
    with Xen, and it looks like the AMD gart64 code will also punch holes
    dynamically.

    Actually hitting that case is still very unlikely, so just do the
    access, and take an exception and fix it up for the very unlikely case
    of it being a page-crosser with no next page.

    And hey, this abstraction might even help other architectures that have
    other issues with unaligned word accesses than the possible missing next
    page. IOW, this could do the byte order magic too.

    Peter Anvin fixed a thinko in the shifting for the exception case.

    Reported-and-tested-by: Jana Saout
    Cc: Peter Anvin
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Mar, 2012

1 commit

  • In d_materialise_unique() there are 3 subcases to the 'aliased dentry'
    case; in two subcases the inode i_lock is properly released but this
    does not occur in the -ELOOP subcase.

    This seems to have been introduced by commit 1836750115f2 ("fix loop
    checks in d_materialise_unique()").

    Signed-off-by: Michel Lespinasse
    Cc: stable@vger.kernel.org # v3.0+
    [ Added a comment, and moved the unlock to where we generate the -ELOOP,
    which seems to be more natural.

    You probably can't actually trigger this without a buggy network file
    server - d_materialize_unique() is for finding aliases on non-local
    filesystems, and the d_ancestor() case is for a hardlinked directory
    loop.

    But we should be robust in the case of such buggy servers anyway. ]
    Signed-off-by: Linus Torvalds

    Michel Lespinasse