14 Jan, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    rbd: fix cleanup when trying to mount inexistent image
    net/ceph: make ceph_msgr_wq non-reentrant
    ceph: fsc->*_wq's aren't used in memory reclaim path
    ceph: Always free allocated memory in osdmap_decode()
    ceph: Makefile: Remove unnessary code
    ceph: associate requests with opening sessions
    ceph: drop redundant r_mds field
    ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
    ceph: add dir_layout to inode

    Linus Torvalds
     

13 Jan, 2011

6 commits

  • fsc->*_wq's aren't depended upon during memory reclaim. Convert to
    alloc_workqueue() w/o WQ_MEM_RECLAIM.

    Signed-off-by: Tejun Heo
    Cc: Sage Weil
    Cc: ceph-devel@vger.kernel.org
    Signed-off-by: Sage Weil

    Tejun Heo
     
  • Remove the if and else conditional because the code is in mainline and there
    is no need in it being there.

    Also, Changed Makefile to use -y instead of -objs
    because -objs is deprecated and not mentioned in
    Documentation/kbuild/makefiles.txt.

    Signed-off-by: Tracey Dent
    Signed-off-by: Sage Weil

    Tracey Dent
     
  • Associate request with sessions that aren't yep open. This makes the
    debugfs mdsc request list more informative.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • The r_mds field is redundant, since we can find the same information at
    r_session->s_mds, and when r_session is NULL then r_mds is meaningless.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • This implements the DIRLAYOUTHASH protocol feature, which passes the dir
    layout over the wire from the MDS. This gives the client knowledge
    of the correct hash function to use for mapping dentries among dir
    fragments.

    Note that if this feature is _not_ present on the client but is on the
    MDS, the client may misdirect requests. This will result in a forward
    and degrade performance. It may also result in inaccurate NFS filehandle
    generation, which will prevent fh resolution when the inode is not present
    in the client cache and the parent directories have been fragmented.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • Add a ceph_dir_layout to the inode, and calculate dentry hash values based
    on the parent directory's specified dir_hash function. This is needed
    because the old default Linux dcache hash function is extremely week and
    leads to a poor distribution of files among dir fragments.

    Signed-off-by: Sage Weil

    Sage Weil
     

07 Jan, 2011

8 commits

  • Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Require filesystems be aware of .d_revalidate being called in rcu-walk
    mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
    -ECHILD from all implementations.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Reduce some branches and memory accesses in dcache lookup by adding dentry
    flags to indicate common d_ops are set, rather than having to check them.
    This saves a pointer memory access (dentry->d_op) in common path lookup
    situations, and saves another pointer load and branch in cases where we
    have d_op but not the particular operation.

    Patched with:

    git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • RCU free the struct inode. This will allow:

    - Subsequent store-free path walking patch. The inode must be consulted for
    permissions when walking, so an RCU inode reference is a must.
    - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
    to take i_lock no longer need to take sb_inode_list_lock to walk the list in
    the first place. This will simplify and optimize locking.
    - Could remove some nested trylock loops in dcache code
    - Could potentially simplify things a bit in VM land. Do not need to take the
    page lock to follow page->mapping.

    The downsides of this is the performance cost of using RCU. In a simple
    creat/unlink microbenchmark, performance drops by about 10% due to inability to
    reuse cache-hot slab objects. As iterations increase and RCU freeing starts
    kicking over, this increases to about 20%.

    In cases where inode lifetimes are longer (ie. many inodes may be allocated
    during the average life span of a single inode), a lot of this cache reuse is
    not applicable, so the regression caused by this patch is smaller.

    The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
    however this adds some complexity to list walking and store-free path walking,
    so I prefer to implement this at a later date, if it is shown to be a win in
    real situations. I haven't found a regression in any non-micro benchmark so I
    doubt it will be a problem.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • dcache_lock no longer protects anything. remove it.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
    using dcache_lock for these anyway (eg. using i_mutex).

    Note: if we change the locking rule in future so that ->d_child protection is
    provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
    But it would be an exception to an otherwise regular locking scheme, so we'd
    have to see some good results. Probably not worthwhile.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Protect d_unhashed(dentry) condition with d_lock. This means keeping
    DCACHE_UNHASHED bit in synch with hash manipulations.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
    0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
    we start protecting many other dentry members with d_lock.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

18 Dec, 2010

2 commits


16 Dec, 2010

1 commit


07 Dec, 2010

1 commit


02 Dec, 2010

4 commits


20 Nov, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: fix readdir EOVERFLOW on 32-bit archs
    ceph: fix frag offset for non-leftmost frags
    ceph: fix dangling pointer
    ceph: explicitly specify page alignment in network messages
    ceph: make page alignment explicit in osd interface
    ceph: fix comment, remove extraneous args
    ceph: fix update of ctime from MDS
    ceph: fix version check on racing inode updates
    ceph: fix uid/gid on resent mds requests
    ceph: fix rdcache_gen usage and invalidate
    ceph: re-request max_size if cap auth changes
    ceph: only let auth caps update max_size
    ceph: fix open for write on clustered mds
    ceph: fix bad pointer dereference in ceph_fill_trace
    ceph: fix small seq message skipping
    Revert "ceph: update issue_seq on cap grant"

    Linus Torvalds
     

19 Nov, 2010

1 commit

  • One of the readdir filldir_t callers was passing the raw ceph 64-bit ino
    instead of the hashed 32-bit one, producing an EOVERFLOW in the filler
    callback. Fix this by calling the ceph_vino_to_ino() helper to do the
    conversion.

    Reported-by: Jan Smets
    Tested-by: Jan Smets
    Signed-off-by: Sage Weil

    Sage Weil
     

18 Nov, 2010

1 commit


12 Nov, 2010

2 commits


10 Nov, 2010

2 commits


09 Nov, 2010

2 commits

  • The client can have a newer ctime than the MDS due to AUTH_EXCL and
    XATTR_EXCL caps as well; update the check in ceph_fill_file_time
    appropriately.

    This fixes cases where ctime/mtime goes backward under the right sequence
    of local updates (e.g. chmod) and mds replies (e.g. subsequent stat that
    goes to the MDS).

    Signed-off-by: Sage Weil

    Sage Weil
     
  • We may get updates on the same inode from multiple MDSs; generally we only
    pay attention if the update is newer than what we already have. The
    exception is when an MDS sense unstable information, in which case we
    always update.

    The old > check got this wrong when our version was odd (e.g. 3) and the
    reply version was even (e.g. 2): the older stale (v2) info would be
    applied. Fixed and clarified the comment.

    Signed-off-by: Sage Weil

    Sage Weil
     

08 Nov, 2010

6 commits

  • MDS requests can be rebuilt and resent in non-process context, but were
    filling in uid/gid from current_fsuid/gid. Put that information in the
    request struct on request setup.

    This fixes incorrect (and root) uid/gid getting set for requests that
    are forwarded between MDSs, usually due to metadata migrations.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • We used to use rdcache_gen to indicate whether we "might" have cached
    pages. Now we just look at the mapping to determine that. However, some
    old behavior remains from that transition.

    First, rdcache_gen == 0 no longer means we have no pages. That can happen
    at any time (presumably when we carry FILE_CACHE). We should not reset it
    to zero, and we should not check that it is zero.

    That means that the only purpose for rdcache_revoking is to resolve races
    between new issues of FILE_CACHE and an async invalidate. If they are
    equal, we should invalidate. On success, we decrement rdcache_revoking,
    so that it is no longer equal to rdcache_gen. Similarly, if we success
    in doing a sync invalidate, set revoking = gen - 1. (This is a small
    optimization to avoid doing unnecessary invalidate work and does not
    affect correctness.)

    Signed-off-by: Sage Weil

    Sage Weil
     
  • If the auth cap migrates to another MDS, clear requested_max_size so that
    we resend any pending max_size increase requests. This fixes potential
    hangs on writes that extend a file and race with an cap migration between
    MDSs.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • Only the auth MDS has a meaningful max_size value for us, so only update it
    in fill_inode if we're being issued an auth cap. Otherwise, a random
    stat result from a non-auth MDS can clobber a meaningful max_size, get
    the clientmds cap state out of sync, and make writes hang.

    Specifically, even if the client re-requests a larger max_size (which it
    will), the MDS won't respond because as far as it knows we already have a
    sufficiently large value.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • Normally when we open a file we already have a cap, and simply update the
    wanted set. However, if we open a file for write, but don't have an auth
    cap, that doesn't work; we need to open a new cap with the auth MDS. Only
    reuse existing caps if we are opening for read or the existing cap is auth.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • We dereference *in a few lines down, but only set it on rename. It is
    apparently pretty rare for this to trigger, but I have been hitting it
    with a clustered MDSs.

    Signed-off-by: Sage Weil

    Sage Weil
     

29 Oct, 2010

1 commit


28 Oct, 2010

1 commit

  • This reverts commit d91f2438d881514e4a923fd786dbd94b764a9440.

    The intent of issue_seq is to distinguish between mds->client messages that
    (re)create the cap and those that do not, which means we should _only_ be
    updating that value in the create paths. By updating it in handle_cap_grant,
    we reset it to zero, which then breaks release.

    The larger question is what workload/problem made me think it should be
    updated here...

    Signed-off-by: Sage Weil

    Sage Weil