21 Dec, 2012

2 commits

  • NFS appears to use d_obtain_alias() to create the root dentry rather than
    d_make_root. This can cause 'prepend_path()' to complain that the root
    has a weird name if an NFS filesystem is lazily unmounted. e.g. if
    "/mnt" is an NFS mount then

    { cd /mnt; umount -l /mnt ; ls -l /proc/self/cwd; }

    will cause a WARN message like
    WARNING: at /home/git/linux/fs/dcache.c:2624 prepend_path+0x1d7/0x1e0()
    ...
    Root dentry has weird name <>

    to appear in kernel logs.

    So change d_obtain_alias() to use "/" rather than "" as the anonymous
    name.

    Signed-off-by: NeilBrown
    Cc: Trond Myklebust
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    NeilBrown
     
  • The code that relied on that flag was ripped out of btrfs quite some
    time ago, and never added back. Josef indicated that he was going to
    take a different approach to the problem in btrfs, and that we
    could just eliminate this flag.

    Cc: Josef Bacik
    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

03 Oct, 2012

1 commit

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     

30 Sep, 2012

1 commit

  • IBM reported a deadlock in select_parent(). This was found to be caused
    by taking rename_lock when already locked when restarting the tree
    traversal.

    There are two cases when the traversal needs to be restarted:

    1) concurrent d_move(); this can only happen when not already locked,
    since taking rename_lock protects against concurrent d_move().

    2) racing with final d_put() on child just at the moment of ascending
    to parent; rename_lock doesn't protect against this rare race, so it
    can happen when already locked.

    Because of case 2, we need to be able to handle restarting the traversal
    when rename_lock is already held. This patch fixes all three callers of
    try_to_ascend().

    IBM reported that the deadlock is gone with this patch.

    [ I rewrote the patch to be smaller and just do the "goto again" if the
    lock was already held, but credit goes to Miklos for the real work.
    - Linus ]

    Signed-off-by: Miklos Szeredi
    Cc: Al Viro
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

28 Sep, 2012

1 commit


27 Sep, 2012

1 commit

  • Each iteration of d_delete we reload inode from dentry->d_inode and
    then call S_ISDIR(inode-i_mode), so inode cannot possibly be NULL
    shortly afterwards unless something went horribly wrong.

    Signed-off-by: Alan Cox
    Signed-off-by: Al Viro

    Alan Cox
     

19 Sep, 2012

1 commit

  • IBM reported a soft lockup after applying the fix for the rename_lock
    deadlock. Commit c83ce989cb5f ("VFS: Fix the nfs sillyrename regression
    in kernel 2.6.38") was found to be the culprit.

    The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the
    dentry was killed. This flag can be set on non-killed dentries too,
    which results in infinite retries when trying to traverse the dentry
    tree.

    This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is
    only set in d_kill() and makes try_to_ascend() test only this flag.

    IBM reported successful test results with this patch.

    Signed-off-by: Miklos Szeredi
    Cc: Trond Myklebust
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

14 Jul, 2012

3 commits


09 Jun, 2012

1 commit

  • This reverts commit 7732a557b1342c6e6966efb5f07effcf99f56167 (and commit
    3f50fff4dace23d3cfeb195d5cd4ee813cee68b7, which was a follow-up
    cleanup).

    We're chasing an elusive bug that Dave Jones can apparently reproduce
    using his system call fuzzer tool, and that looks like some kind of
    locking ordering problem on the directory i_mutex chain. Our i_mutex
    locking is rather complex, and depends on the topological ordering of
    the directories, which is why we have been very wary of splicing
    directory entries around.

    Of course, we really don't want to ever see aliased unconnected
    directories anyway, so none of this should ever happen, but this revert
    aims to basically get us back to a known older state.

    Bruce points to some of the previous discussion at

    http://marc.info/?i=

    and in particular a long post from Neil:

    http://marc.info/?i=

    It should be noted that it's possible that Dave's problems come from
    other changes altohgether, including possibly just the fact that Dave
    constantly is teachning his fuzzer new tricks. So what appears to be a
    new bug could in fact be an old one that just gets newly triggered, but
    reverting these patches as "still under heavy discussion" is the right
    thing regardless.

    Requested-by: Al Viro
    Acked-by: J. Bruce Fields
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 May, 2012

2 commits

  • Nobody sets want_disconn any more.

    Reported-by: Peng Tao
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • A directory should never have more than one dentry pointing to it.

    But d_splice_alias() will add one if it finds a directory with an
    already-existing non-DISCONNECTED dentry.

    I can't find an obvious reproducer, but I also can't see what prevents
    d_splice_alias() from encountering such a case.

    It therefore seems safest to allow d_splice_alias to use any dentry it
    finds.

    (Prior to the removal of dentry_unhash() from vfs_rmdir(), around v3.0,
    this could cause an nfsd deadlock like this:

    - Somebody attempts to remove a non-empty directory.
    - The dentry_unhash() in vfs_rmdir() unhashes the dentry
    pointing to the non-empty directory.
    - ->rmdir() then fails with -ENOTEMPTY
    - Before the vfs_rmdir() caller reaches dput(), an nfsd process
    in rename looks up the directory by filehandle; at the end of
    that lookup, this dentry is found by d_alloc_anon(), and a
    reference is taken on it, preventing dput() from removing it.
    - A regular lookup of the directory calls d_splice_alias(),
    finds only an unhashed (not a DISCONNECTED) dentry, and
    insteads adds a new one, so the directory now has two
    dentries.
    - The nfsd process in rename, which was previously looking up
    the source directory of the rename, now looks up the target
    directory (which is the same), and gets the dentry newly
    created by the previous lookup.
    - The rename, seeing two different dentries, assumes this is a
    cross-directory rename and attempts to take the i_mutex on the
    directory twice.

    That reproducer no longer exists, but I don't think there was anything
    fundamentally incorrect about the vfs_rmdir() behavior there, so I think
    the real fault was here in d_splice_alias().)

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     

30 May, 2012

1 commit

  • lglocks and brlocks are currently generated with some complicated macros
    in lglock.h. But there's no reason to not just use common utility
    functions and put all the data into a common data structure.

    In preparation, this patch changes the API to look more like normal
    function calls with pointers, not magic macros.

    The patch is rather large because I move over all users in one go to keep
    it bisectable. This impacts the VFS somewhat in terms of lines changed.
    But no actual behaviour change.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Andi Kleen
    Cc: Al Viro
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Rusty Russell
    Signed-off-by: Al Viro

    Andi Kleen
     

24 May, 2012

1 commit

  • UDP stack needs a minimum hash size value for proper operation and also
    uses alloc_large_system_hash() for proper NUMA distribution of its hash
    tables and automatic sizing depending on available system memory.

    On some low memory situations, udp_table_init() must ignore the
    alloc_large_system_hash() result and reallocs a bigger memory area.

    As we cannot easily free old hash table, we leak it and kmemleak can
    issue a warning.

    This patch adds a low limit parameter to alloc_large_system_hash() to
    solve this problem.

    We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
    allocation.

    Reported-by: Mark Asselstine
    Reported-by: Tim Bird
    Signed-off-by: Eric Dumazet
    Cc: Paul Gortmaker
    Signed-off-by: David S. Miller

    Tim Bird
     

22 May, 2012

2 commits

  • This reverts commit 8c01a529b861ba97c7d78368e6a5d4d42e946f75.

    It turns out the d_unhashed() check isn't unnecessary after all: while
    it's true that unhashing will increment the sequence numbers, that does
    not necessarily invalidate the RCU lookup, because it might have seen
    the dentry pointer (before it got unhashed), but by the time it loaded
    the sequence number, it could have seen the *new* sequence number (after
    it got unhashed).

    End result: we might look up an unhashed dentry that is about to be
    freed, with the sequence number never indicating anything bad about it.
    So checking that the dentry is still hashed (*after* reading the sequence
    number) is indeed the proper fix, and was never unnecessary.

    Reported-by: Dave Jones
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Miklos Szeredi points out that we need to also worry about memory
    odering when doing the dentry name comparison asynchronously with RCU.

    In particular, doing a rename can do a memcpy() of one dentry name over
    another, and we want to make sure that any unlocked reader will always
    see the proper terminating NUL character, so that it won't ever run off
    the allocation.

    Rather than having to be extra careful with the name copy or at lookup
    time for each character, this resolves the issue by making sure that all
    names that are inlined in the dentry always have a NUL character at the
    end of the name allocation. If we do that at dentry allocation time, we
    know that no future name copy will ever change that final NUL to
    anything else, so there are no memory ordering issues.

    So even if a concurrent rename ends up overwriting the NUL character
    that terminates the original name, we always know that there is one
    final NUL at the end, and there is no worry about the lockless RCU
    lookup traversing the name too far.

    The out-of-line allocations are never copied over, so we can just make
    sure that we write the name (with terminating NULL) and do a write
    barrier before we expose the name to anything else by setting it in the
    dentry.

    Reported-by: Miklos Szeredi
    Cc: Al Viro
    Cc: Nick Piggin
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 May, 2012

4 commits

  • This allows comparing hash and len in one operation on 64-bit
    architectures. Right now only __d_lookup_rcu() takes advantage of this,
    since that is the case we care most about.

    The use of anonymous struct/unions hides the alternate 64-bit approach
    from most users, the exception being a few cases where we initialize a
    'struct qstr' with a static initializer. This makes the problematic
    cases use a new QSTR_INIT() helper function for that (but initializing
    just the name pointer with a "{ .name = xyzzy }" initializer remains
    valid, as does just copying another qstr structure).

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • All callers do want to check the dentry length, but some of them can
    check the length and the hash together, so doing it in dentry_cmp() can
    be counter-productive.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and dentry_cmp()
    interfaces") did the careful ACCESS_ONCE() of the dentry name only for
    the word-at-a-time case, even though the issue is generic.

    Admittedly I don't really see gcc ever reloading the value in the middle
    of the loop, so the ACCESS_ONCE() protects us from a fairly theoretical
    issue. But better safe than sorry.

    Also, this consolidates the common parts of the word-at-a-time and
    bytewise logic, which includes checking the length. We'll be changing
    that later.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The check for d_unhashed() is not strictly incorrect, but at the same
    time it is also not sensible. The actual dentry removal from the dentry
    hash chains is totally asynchronous to the __d_lookup_rcu() logic, and
    we depend on __d_drop() updating the sequence number to invalidate any
    lookup of an unhashed dentry.

    So checking d_unhashed() is not incorrect, but it's not useful either:
    the code has to work correctly even without it. So just remove it.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

05 May, 2012

1 commit

  • The calling conventions for __d_lookup_rcu() and dentry_cmp() are
    annoying in different ways, and there is actually one single underlying
    reason for both of the annoyances.

    The fundamental reason is that we do the returned dentry sequence number
    check inside __d_lookup_rcu() instead of doing it in the caller. This
    results in two annoyances:

    - __d_lookup_rcu() now not only needs to return the dentry and the
    sequence number that goes along with the lookup, it also needs to
    return the inode pointer that was validated by that sequence number
    check.

    - and because we did the sequence number check early (to validate the
    name pointer and length) we also couldn't just pass the dentry itself
    to dentry_cmp(), we had to pass the counted string that contained the
    name.

    So that sequence number decision caused two separate ugly calling
    conventions.

    Both of these problems would be solved if we just did the sequence
    number check in the caller instead. There's only one caller, and that
    caller already has to do the sequence number check for the parent
    anyway, so just do that.

    That allows us to stop returning the dentry->d_inode in that in-out
    argument (pointer-to-pointer-to-inode), so we can make the inode
    argument just a regular input inode pointer. The caller can just load
    the inode from dentry->d_inode, and then do the sequence number check
    after that to make sure that it's synchronized with the name we looked
    up.

    And it allows us to just pass in the dentry to dentry_cmp(), which is
    what all the callers really wanted. Sure, dentry_cmp() has to be a bit
    careful about the dentry (which is not stable during RCU lookup), but
    that's actually very simple.

    And now that dentry_cmp() can clearly see that the first string argument
    is a dentry, we can use the direct word access for that, instead of the
    careful unaligned zero-padding. The dentry name is always properly
    aligned, since it is a single path component that is either embedded
    into the dentry itself, or was allocated with kmalloc() (see __d_alloc).

    Finally, this also uninlines the nasty slow-case for dentry comparisons:
    that one *does* need to do a sequence number check, since it will call
    in to the low-level filesystems, and we want to give those a stable
    inode pointer and path component length/start arguments. Doing an extra
    sequence check for that slow case is not a problem, though.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 May, 2012

1 commit

  • It turns out that there are more cases than CONFIG_DEBUG_PAGEALLOC that
    can have holes in the kernel address space: it seems to happen easily
    with Xen, and it looks like the AMD gart64 code will also punch holes
    dynamically.

    Actually hitting that case is still very unlikely, so just do the
    access, and take an exception and fix it up for the very unlikely case
    of it being a page-crosser with no next page.

    And hey, this abstraction might even help other architectures that have
    other issues with unaligned word accesses than the possible missing next
    page. IOW, this could do the byte order magic too.

    Peter Anvin fixed a thinko in the shifting for the exception case.

    Reported-and-tested-by: Jana Saout
    Cc: Peter Anvin
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Mar, 2012

1 commit

  • In d_materialise_unique() there are 3 subcases to the 'aliased dentry'
    case; in two subcases the inode i_lock is properly released but this
    does not occur in the -ELOOP subcase.

    This seems to have been introduced by commit 1836750115f2 ("fix loop
    checks in d_materialise_unique()").

    Signed-off-by: Michel Lespinasse
    Cc: stable@vger.kernel.org # v3.0+
    [ Added a comment, and moved the unlock to where we generate the -ELOOP,
    which seems to be more natural.

    You probably can't actually trigger this without a buggy network file
    server - d_materialize_unique() is for finding aliases on non-local
    filesystems, and the d_ancestor() case is for a hardlinked directory
    loop.

    But we should be robust in the case of such buggy servers anyway. ]
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

25 Mar, 2012

1 commit

  • Pull cleanup of fs/ and lib/ users of module.h from Paul Gortmaker:
    "Fix up files in fs/ and lib/ dirs to only use module.h if they really
    need it.

    These are trivial in scope vs the work done previously. We now have
    things where any few remaining cleanups can be farmed out to arch or
    subsystem maintainers, and I have done so when possible. What is
    remaining here represents the bits that don't clearly lie within a
    single arch/subsystem boundary, like the fs dir and the lib dir.

    Some duplicate includes arising from overlapping fixes from
    independent subsystem maintainer submissions are also quashed."

    Fix up trivial conflicts due to clashes with other include file cleanups
    (including some due to the previous bug.h cleanup pull).

    * tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
    lib: reduce the use of module.h wherever possible
    fs: reduce the use of module.h wherever possible
    includecheck: delete any duplicate instances of module.h

    Linus Torvalds
     

23 Mar, 2012

1 commit

  • Fix kernel-doc warnings in fs/dcache.c:

    Warning(fs/dcache.c:1743): No description found for parameter 'seqp'
    Warning(fs/dcache.c:1743): Excess function parameter 'seq' description in '__d_lookup_rcu'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

22 Mar, 2012

1 commit

  • Pull vfs pile 1 from Al Viro:
    "This is _not_ all; in particular, Miklos' and Jan's stuff is not there
    yet."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
    ext4: initialization of ext4_li_mtx needs to be done earlier
    debugfs-related mode_t whack-a-mole
    hfsplus: add an ioctl to bless files
    hfsplus: change finder_info to u32
    hfsplus: initialise userflags
    qnx4: new helper - try_extent()
    qnx4: get rid of qnx4_bread/qnx4_getblk
    take removal of PF_FORKNOEXEC to flush_old_exec()
    trim includes in inode.c
    um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
    um: embed ->stub_pages[] into mmu_context
    gadgetfs: list_for_each_safe() misuse
    ocfs2: fix leaks on failure exits in module_init
    ecryptfs: make register_filesystem() the last potential failure exit
    ntfs: forgets to unregister sysctls on register_filesystem() failure
    logfs: missing cleanup on register_filesystem() failure
    jfs: mising cleanup on register_filesystem() failure
    make configfs_pin_fs() return root dentry on success
    configfs: configfs_create_dir() has parent dentry in dentry->d_parent
    configfs: sanitize configfs_create()
    ...

    Linus Torvalds
     

21 Mar, 2012

1 commit


20 Mar, 2012

2 commits

  • * branch 'dcache-word-accesses':
    vfs: use 'unsigned long' accesses for dcache name comparison and hashing

    This does the name hashing and lookup using word-sized accesses when
    that is efficient, namely on x86 (although any little-endian machine
    with good unaligned accesses would do).

    It does very much depend on little-endian logic, but it's a very hot
    couple of functions under some real loads, and this patch improves the
    performance of __d_lookup_rcu() and link_path_walk() by up to about 30%.
    Giving a 10% improvement on some very pathname-heavy benchmarks.

    Because we do make unaligned accesses past the filename, the
    optimization is disabled when CONFIG_DEBUG_PAGEALLOC is active, and we
    effectively depend on the fact that on x86 we don't really ever have the
    last page of usable RAM followed immediately by any IO memory (due to
    ACPI tables, BIOS buffer areas etc).

    Some of the bit operations we do are a bit "subtle". It's commented,
    but you do need to really think about the code. Or just consider it
    black magic.

    Thanks to people on G+ for some of the optimized bit tricks.

    Linus Torvalds
     
  • For some odd historical reason, the final mixing round for the dentry
    cache hash table lookup had an insane "xor with big constant" logic. In
    two places.

    The big constant that is being xor'ed is GOLDEN_RATIO_PRIME, which is a
    fairly random-looking number that is designed to be *multiplied* with so
    that the bits get spread out over a whole long-word.

    But xor'ing with it is insane. It doesn't really even change the hash -
    it really only shifts the hash around in the hash table. To make
    matters worse, the insane big constant is different on 32-bit and 64-bit
    builds, even though the name hash bits we use are always 32-bit (and the
    bits from the pointer we mix in effectively are too).

    It's all total voodoo programming, in other words.

    Now, some testing and analysis of the hash chains shows that the rest of
    the hash function seems to be fairly good. It does pick the right bits
    of the parent dentry pointer, for example, and while it's generally a
    bad idea to use an xor to mix down the upper bits (because if there is a
    repeating pattern, the xor can cause "destructive interference"), it
    seems to not have been a disaster.

    For example, replacing the hash with the normal "hash_long()" code (that
    uses the GOLDEN_RATIO_PRIME constant correctly, btw) actually just makes
    the hash worse. The hand-picked hash knew which bits of the pointer had
    the highest entropy, and hash_long() ends up mixing bits less optimally
    at least in some trivial tests.

    So the hash function overall seems fine, it just has that really odd
    "shift result around by a constant xor".

    So get rid of the silly xor, and replace the down-mixing of the bits
    with an add instead of an xor that tends to not have the same kind of
    destructive interference issues. Some stats on the resulting hash
    chains shows that they look statistically identical before and after,
    but the code is simpler and no longer makes you go "WTF?".

    Also, the incoming hash really is just "unsigned int", not a long, and
    there's no real point to worry about the high 26 bits of the dentry
    pointer for the 64-bit case, because they are all going to be identical
    anyway.

    So also change the hashing to be done in the more natural 'unsigned int'
    that is the real size of the actual hashed data anyway.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Mar, 2012

1 commit


05 Mar, 2012

1 commit

  • It's only used inside fs/dcache.c, and we're going to play games with it
    for the word-at-a-time patches. This time we really don't even want to
    export it, because it really is an internal function to fs/dcache.c, and
    has been since it was introduced.

    Having it in that extremely hot header file (it's included in pretty
    much everything, thanks to ) is a disaster for testing
    different versions, and is utterly pointless.

    We really should have some kind of header file diet thing, where we
    figure out which parts of header files are really better off private and
    only result in more expensive compiles.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Mar, 2012

1 commit

  • These don't change any semantics, but they clean up the code a bit and
    mark some arguments appropriately 'const'.

    They came up as I was doing the word-at-a-time dcache name accessor
    code, and cleaning this up now allows me to send out a smaller relevant
    interesting patch for the experimental stuff.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Feb, 2012

1 commit


14 Feb, 2012

1 commit

  • When the number of dentry cache hash table entries gets too high
    (2147483648 entries), as happens by default on a 16TB system, use of a
    signed integer in the dcache_init() initialization loop prevents the
    dentry_hashtable from getting initialized, causing a panic in
    __d_lookup(). Fix this in dcache_init() and similar areas.

    Signed-off-by: Dimitri Sivanich
    Acked-by: David S. Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dimitri Sivanich
     

14 Jan, 2012

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: ensure prealloc_blob is in place when removing xattr
    rbd: initialize snap_rwsem in rbd_add()
    ceph: enable/disable dentry complete flags via mount option
    vfs: export symbol d_find_any_alias()
    ceph: always initialize the dentry in open_root_dentry()
    libceph: remove useless return value for osd_client __send_request()
    ceph: avoid iput() while holding spinlock in ceph_dir_fsync
    ceph: avoid useless dget/dput in encode_fh
    ceph: dereference pointer after checking for NULL
    crush: fix force for non-root TAKE
    ceph: remove unnecessary d_fsdata conditional checks
    ceph: Use kmemdup rather than duplicating its implementation

    Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
    always initialize the dentry in open_root_dentry)

    Linus Torvalds
     

13 Jan, 2012

1 commit


11 Jan, 2012

1 commit

  • Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may
    cause shrink_dcache_parent() to loop forever.

    Here's what appears to happen:

    1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1

    2 - CPU1: select_parent(P) locks P->d_lock

    3 - CPU0: shrink_dentry_list() locks C->d_lock
    dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock

    4 - CPU1: select_parent(P) locks C->d_lock,
    moves C from dispose list being processed on CPU0 to the new
    dispose list, returns 1

    5 - CPU0: shrink_dentry_list() finds dispose list empty, returns

    6 - Goto 2 with CPU0 and CPU1 switched

    Basically select_parent() steals the dentry from shrink_dentry_list() and thinks
    it found a new one, causing shrink_dentry_list() to think it's making progress
    and loop over and over.

    One way to trigger this is to make udev calls stat() on the sysfs file while it
    is going away.

    Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick:

    ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true"

    Then execute the following loop:

    while true; do
    echo -bond0 > /sys/class/net/bonding_masters
    echo +bond0 > /sys/class/net/bonding_masters
    echo -bond1 > /sys/class/net/bonding_masters
    echo +bond1 > /sys/class/net/bonding_masters
    done

    One fix would be to check all callers and prevent concurrent calls to
    shrink_dcache_parent(). But I think a better solution is to stop the
    stealing behavior.

    This patch adds a new dentry flag that is set when the dentry is added to the
    dispose list. The flag is cleared in dentry_lru_del() in case the dentry gets a
    new reference just before being pruned.

    If the dentry has this flag, select_parent() will skip it and let
    shrink_dentry_list() retry pruning it. With select_parent() skipping those
    dentries there will not be the appearance of progress (new dentries found) when
    there is none, hence shrink_dcache_parent() will not loop forever.

    Set the flag is also set in prune_dcache_sb() for consistency as suggested by
    Linus.

    Signed-off-by: Miklos Szeredi
    CC: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Miklos Szeredi
     

10 Jan, 2012

2 commits

  • d_alloc_root() with iput() in case of allocation failure...

    Signed-off-by: Al Viro

    Al Viro
     
  • select_parent currently abuses the dentry cache LRU to provide
    cleanup features for child dentries that need to be freed. It moves
    them to the tail of the LRU, then tells shrink_dcache_parent() to
    calls __shrink_dcache_sb to unconditionally move them to a dispose
    list (as DCACHE_REFERENCED is ignored). __shrink_dcache_sb() has to
    relock the dentries to move them off the LRU onto the dispose list,
    but otherwise does not touch the dentries that select_parent() moved
    to the tail of the LRU. It then passses the dispose list to
    shrink_dentry_list() which tries to free the dentries.

    IOWs, the use of __shrink_dcache_sb() is superfluous - we can build
    exactly the same list of dentries for disposal directly in
    select_parent() and call shrink_dentry_list() instead of calling
    __shrink_dcache_sb() to do that. This means that we avoid long holds
    on the lru lock walking the LRU moving dentries to the dispose list
    We also avoid the need to relock each dentry just to move it off the
    LRU, reducing the numebr of times we lock each dentry to dispose of
    them in shrink_dcache_parent() from 3 to 2 times.

    Further, we remove one of the two callers of __shrink_dcache_sb().
    This also means that __shrink_dcache_sb can be moved into back into
    prune_dcache_sb() and we no longer have to handle referenced
    dentries conditionally, simplifying the code.

    Signed-off-by: Dave Chinner
    Signed-off-by: Linus Torvalds
    Signed-off-by: Al Viro

    Dave Chinner