14 Jan, 2012

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: ensure prealloc_blob is in place when removing xattr
    rbd: initialize snap_rwsem in rbd_add()
    ceph: enable/disable dentry complete flags via mount option
    vfs: export symbol d_find_any_alias()
    ceph: always initialize the dentry in open_root_dentry()
    libceph: remove useless return value for osd_client __send_request()
    ceph: avoid iput() while holding spinlock in ceph_dir_fsync
    ceph: avoid useless dget/dput in encode_fh
    ceph: dereference pointer after checking for NULL
    crush: fix force for non-root TAKE
    ceph: remove unnecessary d_fsdata conditional checks
    ceph: Use kmemdup rather than duplicating its implementation

    Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
    always initialize the dentry in open_root_dentry)

    Linus Torvalds
     

11 Jan, 2012

1 commit


04 Jan, 2012

1 commit

  • Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
    it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
    the cost of taking it into inode_init_always() will be negligible for pipes
    and sockets and negative for everything else. Not to mention the removal of
    boilerplate code from ->destroy_inode() instances...

    Signed-off-by: Al Viro

    Al Viro
     

08 Dec, 2011

1 commit

  • We have been using i_lock to protect all kinds of data structures in the
    ceph_inode_info struct, including lists of inodes that we need to iterate
    over while avoiding races with inode destruction. That requires grabbing
    a reference to the inode with the list lock protected, but igrab() now
    takes i_lock to check the inode flags.

    Changing the list lock ordering would be a painful process.

    However, using a ceph-specific i_ceph_lock in the ceph inode instead of
    i_lock is a simple mechanical change and avoids the ordering constraints
    imposed by igrab().

    Reported-by: Amon Ott
    Signed-off-by: Sage Weil

    Sage Weil
     

06 Nov, 2011

2 commits

  • If we queue a work item that calls iput(), make sure we ihold() before
    attempting to queue work. Otherwise our queued work might miraculously run
    before we notice the queue_work() succeeded and call ihold(), allowing the
    inode to be destroyed.

    That is, instead of

    if (queue_work(...))
    ihold();

    we need to do

    ihold();
    if (!queue_work(...))
    iput();

    Reported-by: Amon Ott
    Signed-off-by: Sage Weil

    Sage Weil
     
  • We used to use a flag on the directory inode to track whether the dcache
    contents for a directory were a complete cached copy. Switch to a dentry
    flag CEPH_D_COMPLETE that is safely updated by ->d_prune().

    Signed-off-by: Sage Weil

    Sage Weil
     

02 Nov, 2011

1 commit


26 Oct, 2011

1 commit

  • This reverts commit c9af9fb68e01eb2c2165e1bc45cfeeed510c64e6.

    We need to block and truncate all pages in order to reliably invalidate
    them. Otherwise, we could:

    - have some uptodate pages in the cache
    - queue an invalidate
    - write(2) locks some pages
    - invalidate_work skips them
    - write(2) only overwrites part of the page
    - page now dirty and uptodate
    -> partial leakage of invalidated data

    It's not entirely clear why we started skipping locked pages in the first
    place. I just ran this through fsx and didn't see any problems.

    Signed-off-by: Sage Weil

    Sage Weil
     

27 Jul, 2011

5 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
    ceph: document unlocked d_parent accesses
    ceph: explicitly reference rename old_dentry parent dir in request
    ceph: document locking for ceph_set_dentry_offset
    ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
    ceph: protect d_parent access in ceph_d_revalidate
    ceph: protect access to d_parent
    ceph: handle racing calls to ceph_init_dentry
    ceph: set dir complete frag after adding capability
    rbd: set blk_queue request sizes to object size
    ceph: set up readahead size when rsize is not passed
    rbd: cancel watch request when releasing the device
    ceph: ignore lease mask
    ceph: fix ceph_lookup_open intent usage
    ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
    ceph: fix bad parent_inode calc in ceph_lookup_open
    ceph: avoid carrying Fw cap during write into page cache
    libceph: don't time out osd requests that haven't been received
    ceph: report f_bfree based on kb_avail rather than diffing.
    ceph: only queue capsnap if caps are dirty
    ceph: fix snap writeback when racing with writes
    ...

    Linus Torvalds
     
  • Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • d_parent is protected by d_lock: use it when looking up a dentry's parent
    directory inode. Also take a reference and drop it in the caller to avoid
    a use-after-free.

    Reported-by: Al Viro
    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • Curretly ceph_add_cap clears the complete bit if we are newly issued the
    FILE_SHARED cap, which is normally the case for a newly issue cap on a new
    directory. That means we clear the just-set bit. Move the check that sets
    the flag to after the cap is added/updated.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • The lease mask is no longer used (and it changed a while back). Instead,
    use a non-zero duration to indicate that there is a lease being issued.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     

20 Jul, 2011

3 commits


08 Jun, 2011

1 commit


12 May, 2011

1 commit

  • We increments i_wrbuffer_ref when taking the Fb cap. This breaks
    the dirty page accounting and causes looping in
    __ceph_do_pending_vmtruncate, and ceph client hangs.

    This bug can be reproduced occasionally by running blogbench.

    Add a new field i_wb_ref to inode and dedicate it to Fb reference
    counting.

    Signed-off-by: Henry C Chang
    Signed-off-by: Sage Weil

    Henry C Chang
     

05 May, 2011

1 commit


22 Mar, 2011

1 commit


16 Mar, 2011

1 commit

  • d_move puts the renamed dentry at the end of d_subdirs, screwing with our
    cached dentry directory offsets. We were just clearing I_COMPLETE to avoid
    any possibility of trouble. However, assigning the renamed dentry an
    offset at the end of the directory (to match it's new d_subdirs position)
    is sufficient to maintain correct behavior and hold onto I_COMPLETE.

    This is especially important for workloads like rsync, which renames files
    into place. Before, we would lose I_COMPLETE and do MDS lookups for each
    file. With this patch we only talk to the MDS on create and rename.

    Signed-off-by: Sage Weil

    Sage Weil
     

04 Mar, 2011

1 commit


28 Jan, 2011

1 commit


14 Jan, 2011

2 commits


13 Jan, 2011

2 commits

  • This implements the DIRLAYOUTHASH protocol feature, which passes the dir
    layout over the wire from the MDS. This gives the client knowledge
    of the correct hash function to use for mapping dentries among dir
    fragments.

    Note that if this feature is _not_ present on the client but is on the
    MDS, the client may misdirect requests. This will result in a forward
    and degrade performance. It may also result in inaccurate NFS filehandle
    generation, which will prevent fh resolution when the inode is not present
    in the client cache and the parent directories have been fragmented.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • Add a ceph_dir_layout to the inode, and calculate dentry hash values based
    on the parent directory's specified dir_hash function. This is needed
    because the old default Linux dcache hash function is extremely week and
    leads to a poor distribution of files among dir fragments.

    Signed-off-by: Sage Weil

    Sage Weil
     

07 Jan, 2011

5 commits

  • Signed-off-by: Nick Piggin

    Nick Piggin
     
  • RCU free the struct inode. This will allow:

    - Subsequent store-free path walking patch. The inode must be consulted for
    permissions when walking, so an RCU inode reference is a must.
    - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
    to take i_lock no longer need to take sb_inode_list_lock to walk the list in
    the first place. This will simplify and optimize locking.
    - Could remove some nested trylock loops in dcache code
    - Could potentially simplify things a bit in VM land. Do not need to take the
    page lock to follow page->mapping.

    The downsides of this is the performance cost of using RCU. In a simple
    creat/unlink microbenchmark, performance drops by about 10% due to inability to
    reuse cache-hot slab objects. As iterations increase and RCU freeing starts
    kicking over, this increases to about 20%.

    In cases where inode lifetimes are longer (ie. many inodes may be allocated
    during the average life span of a single inode), a lot of this cache reuse is
    not applicable, so the regression caused by this patch is smaller.

    The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
    however this adds some complexity to list walking and store-free path walking,
    so I prefer to implement this at a later date, if it is shown to be a win in
    real situations. I haven't found a regression in any non-micro benchmark so I
    doubt it will be a problem.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • dcache_lock no longer protects anything. remove it.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
    using dcache_lock for these anyway (eg. using i_mutex).

    Note: if we change the locking rule in future so that ->d_child protection is
    provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
    But it would be an exception to an otherwise regular locking scheme, so we'd
    have to see some good results. Probably not worthwhile.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
    0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
    we start protecting many other dentry members with d_lock.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

20 Nov, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: fix readdir EOVERFLOW on 32-bit archs
    ceph: fix frag offset for non-leftmost frags
    ceph: fix dangling pointer
    ceph: explicitly specify page alignment in network messages
    ceph: make page alignment explicit in osd interface
    ceph: fix comment, remove extraneous args
    ceph: fix update of ctime from MDS
    ceph: fix version check on racing inode updates
    ceph: fix uid/gid on resent mds requests
    ceph: fix rdcache_gen usage and invalidate
    ceph: re-request max_size if cap auth changes
    ceph: only let auth caps update max_size
    ceph: fix open for write on clustered mds
    ceph: fix bad pointer dereference in ceph_fill_trace
    ceph: fix small seq message skipping
    Revert "ceph: update issue_seq on cap grant"

    Linus Torvalds
     

18 Nov, 2010

1 commit


10 Nov, 2010

1 commit

  • We used to infer alignment of IOs within a page based on the file offset,
    which assumed they matched. This broke with direct IO that was not aligned
    to pages (e.g., 512-byte aligned IO). We were also trusting the alignment
    specified in the OSD reply, which could have been adjusted by the server.

    Explicitly specify the page alignment when setting up OSD IO requests.

    Signed-off-by: Sage Weil

    Sage Weil
     

09 Nov, 2010

2 commits

  • The client can have a newer ctime than the MDS due to AUTH_EXCL and
    XATTR_EXCL caps as well; update the check in ceph_fill_file_time
    appropriately.

    This fixes cases where ctime/mtime goes backward under the right sequence
    of local updates (e.g. chmod) and mds replies (e.g. subsequent stat that
    goes to the MDS).

    Signed-off-by: Sage Weil

    Sage Weil
     
  • We may get updates on the same inode from multiple MDSs; generally we only
    pay attention if the update is newer than what we already have. The
    exception is when an MDS sense unstable information, in which case we
    always update.

    The old > check got this wrong when our version was odd (e.g. 3) and the
    reply version was even (e.g. 2): the older stale (v2) info would be
    applied. Fixed and clarified the comment.

    Signed-off-by: Sage Weil

    Sage Weil
     

08 Nov, 2010

3 commits

  • We used to use rdcache_gen to indicate whether we "might" have cached
    pages. Now we just look at the mapping to determine that. However, some
    old behavior remains from that transition.

    First, rdcache_gen == 0 no longer means we have no pages. That can happen
    at any time (presumably when we carry FILE_CACHE). We should not reset it
    to zero, and we should not check that it is zero.

    That means that the only purpose for rdcache_revoking is to resolve races
    between new issues of FILE_CACHE and an async invalidate. If they are
    equal, we should invalidate. On success, we decrement rdcache_revoking,
    so that it is no longer equal to rdcache_gen. Similarly, if we success
    in doing a sync invalidate, set revoking = gen - 1. (This is a small
    optimization to avoid doing unnecessary invalidate work and does not
    affect correctness.)

    Signed-off-by: Sage Weil

    Sage Weil
     
  • Only the auth MDS has a meaningful max_size value for us, so only update it
    in fill_inode if we're being issued an auth cap. Otherwise, a random
    stat result from a non-auth MDS can clobber a meaningful max_size, get
    the clientmds cap state out of sync, and make writes hang.

    Specifically, even if the client re-requests a larger max_size (which it
    will), the MDS won't respond because as far as it knows we already have a
    sufficiently large value.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • We dereference *in a few lines down, but only set it on rename. It is
    apparently pretty rare for this to trigger, but I have been hitting it
    with a clustered MDSs.

    Signed-off-by: Sage Weil

    Sage Weil