08 Dec, 2011

1 commit

  • We have been using i_lock to protect all kinds of data structures in the
    ceph_inode_info struct, including lists of inodes that we need to iterate
    over while avoiding races with inode destruction. That requires grabbing
    a reference to the inode with the list lock protected, but igrab() now
    takes i_lock to check the inode flags.

    Changing the list lock ordering would be a painful process.

    However, using a ceph-specific i_ceph_lock in the ceph inode instead of
    i_lock is a simple mechanical change and avoids the ordering constraints
    imposed by igrab().

    Reported-by: Amon Ott
    Signed-off-by: Sage Weil

    Sage Weil
     

27 Jul, 2011

2 commits

  • We carry a pin on the parent directory for the rename source and dest
    dentries. For the source it's r_locked_dir; we need to explicitly
    reference the old_dentry parent as well, since the dentry's d_parent may
    change between when the request was created and pinned and when it is
    freed.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     
  • The lease mask is no longer used (and it changed a while back). Instead,
    use a non-zero duration to indicate that there is a lease being issued.

    Reviewed-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Sage Weil
     

25 May, 2011

1 commit

  • In e9964c10 we change cap flushing to do a delicate dance because some
    inodes on the cap_dirty list could be in a migrating state (got EXPORT but
    not IMPORT) in which we couldn't actually flush and move from
    dirty->flushing, breaking the while (!empty) { process first } loop
    structure. It worked for a single sync thread, but was not reentrant and
    triggered infinite loops when multiple syncers came along.

    Instead, move inodes with dirty to a separate cap_dirty_migrating list
    when in the limbo export-but-no-import state, allowing us to go back to
    the simple loop structure (which was reentrant). This is cleaner and more
    robust.

    Audited the cap_dirty users and this looks fine:
    list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we
    have dirty caps (which list we're on is irrelevant) and list_del_init()
    calls still do the right thing.

    Signed-off-by: Sage Weil

    Sage Weil
     

13 Jan, 2011

2 commits

  • The r_mds field is redundant, since we can find the same information at
    r_session->s_mds, and when r_session is NULL then r_mds is meaningless.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • This implements the DIRLAYOUTHASH protocol feature, which passes the dir
    layout over the wire from the MDS. This gives the client knowledge
    of the correct hash function to use for mapping dentries among dir
    fragments.

    Note that if this feature is _not_ present on the client but is on the
    MDS, the client may misdirect requests. This will result in a forward
    and degrade performance. It may also result in inaccurate NFS filehandle
    generation, which will prevent fh resolution when the inode is not present
    in the client cache and the parent directories have been fragmented.

    Signed-off-by: Sage Weil

    Sage Weil
     

02 Dec, 2010

1 commit


08 Nov, 2010

1 commit

  • MDS requests can be rebuilt and resent in non-process context, but were
    filling in uid/gid from current_fsuid/gid. Put that information in the
    request struct on request setup.

    This fixes incorrect (and root) uid/gid getting set for requests that
    are forwarded between MDSs, usually due to metadata migrations.

    Signed-off-by: Sage Weil

    Sage Weil
     

21 Oct, 2010

1 commit

  • This factors out protocol and low-level storage parts of ceph into a
    separate libceph module living in net/ceph and include/linux/ceph. This
    is mostly a matter of moving files around. However, a few key pieces
    of the interface change as well:

    - ceph_client becomes ceph_fs_client and ceph_client, where the latter
    captures the mon and osd clients, and the fs_client gets the mds client
    and file system specific pieces.
    - Mount option parsing and debugfs setup is correspondingly broken into
    two pieces.
    - The mon client gets a generic handler callback for otherwise unknown
    messages (mds map, in this case).
    - The basic supported/required feature bits can be expanded (and are by
    ceph_fs_client).

    No functional change, aside from some subtle error handling cases that got
    cleaned up in the refactoring process.

    Signed-off-by: Sage Weil

    Yehuda Sadeh
     

23 Aug, 2010

1 commit

  • The use of a completion when waiting for session shutdown during umount is
    inappropriate, given the complexity of the condition. For multiple MDS's,
    this resulted in the umount thread spinning, often preventing the session
    close message from being processed in some cases.

    Switch to a waitqueue and defined a condition helper. This cleans things
    up nicely.

    Signed-off-by: Sage Weil

    Sage Weil
     

02 Aug, 2010

4 commits


17 Jul, 2010

1 commit


11 Jun, 2010

2 commits


18 May, 2010

3 commits

  • We invalidate I_COMPLETE and dentry leases in two places: on aborted mds
    request and on request replay. Use common helper to avoid duplicate code.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • When we abort requests we need to prevent fill_trace et al from doing
    anything that relies on locks held by the VFS caller. This fixes a race
    between the reply handler and the abort code, ensuring that continue
    holding the dir mutex until the reply handler completes.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • We would occasionally BUG out in the reply handler because r_reply was
    nonzero, due to a race with ceph_mdsc_do_request temporarily setting
    r_reply to an ERR_PTR value. This is unnecessary, messy, and also wrong
    in the EIO case.

    Clean up by consistently using r_err for errors and r_reply for messages.
    Also fix the abort logic to trigger consistently for all errors that return
    to the caller early (e.g., EIO from timeout case). If an abort races with
    a reply, use the result from the reply.

    Also fix locking for r_err, r_reply update in the reply handler.

    Signed-off-by: Sage Weil

    Sage Weil
     

18 Feb, 2010

1 commit

  • We need to be able to iterate over all caps on a session with a
    possibly slow callback on each cap. To allow this, we used to
    prevent cap reordering while we were iterating. However, we were
    not safe from races with removal: removing the 'next' cap would
    make the next pointer from list_for_each_entry_safe be invalid,
    and cause a lock up or similar badness.

    Instead, we keep an iterator pointer in the session pointing to
    the current cap. As before, we avoid reordering. For removal,
    if the cap isn't the current cap we are iterating over, we are
    fine. If it is, we clear cap->ci (to mark the cap as pending
    removal) but leave it in the session list. In iterate_caps, we
    can safely finish removal and get the next cap pointer.

    While we're at it, clean up put_cap to not take a cap reservation
    context, as it was never used.

    Signed-off-by: Sage Weil

    Sage Weil
     

17 Feb, 2010

2 commits


26 Jan, 2010

1 commit

  • Previously, if the MDS request was interrupted, we would unregister the
    request and ignore any reply. This could cause the caps or other cache
    state to become out of sync. (For instance, aborting dbench and doing
    rm -r on clients would complain about a non-empty directory because the
    client didn't realize it's aborted file create request completed.)

    Even we don't unregister, we still can't process the reply normally because
    we are no longer holding the caller's locks (like the dir i_mutex).

    So, mark aborted operations with r_aborted, and in the reply handler, be
    sure to process all the caps. Do not process the namespace changes,
    though, since we no longer will hold the dir i_mutex. The dentry lease
    state can also be ignored as it's more forgiving.

    Signed-off-by: Sage Weil

    Sage Weil
     

24 Dec, 2009

1 commit


08 Dec, 2009

1 commit


19 Nov, 2009

2 commits

  • When we open a monitor session, we send an initial AUTH message listing
    the auth protocols we support, our entity name, and (possibly) a previously
    assigned global_id. The monitor chooses a protocol and responds with an
    initial message.

    Initially implement AUTH_NONE, a dummy protocol that provides no security,
    but works within the new framework. It generates 'authorizers' that are
    used when connecting to (mds, osd) services that simply state our entity
    name and global_id.

    This is a wire protocol change.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • Unwind initializing if we get ENOMEM during client initialization.

    Signed-off-by: Sage Weil

    Sage Weil
     

13 Nov, 2009

1 commit


11 Nov, 2009

1 commit


10 Nov, 2009

1 commit

  • We were using the cap_gen to track both stale caps (caps that timed out
    due to temporarily losing touch with the mds) and dead caps that did not
    reconnect after an MDS failure. Introduce a recon_gen counter to track
    reconnections to restarted MDSs and kill dead caps based on that instead.

    Rename gen to cap_gen while we're at it to make it more clear which is
    which.

    Signed-off-by: Sage Weil

    Sage Weil
     

07 Oct, 2009

1 commit

  • The MDS (metadata server) client is responsible for submitting
    requests to the MDS cluster and parsing the response. We decide which
    MDS to submit each request to based on cached information about the
    current partition of the directory hierarchy across the cluster. A
    stateful session is opened with each MDS before we submit requests to
    it, and a mutex is used to control the ordering of messages within
    each session.

    An MDS request may generate two responses. The first indicates the
    operation was a success and returns any result. A second reply is
    sent when the operation commits to disk. Note that locking on the MDS
    ensures that the results of updates are visible only to the updating
    client before the operation commits. Requests are linked to the
    containing directory so that an fsync will wait for them to commit.

    If an MDS fails and/or recovers, we resubmit requests as needed. We
    also reconnect existing capabilities to a recovering MDS to
    reestablish that shared session state. Old dentry leases are
    invalidated.

    Signed-off-by: Sage Weil

    Sage Weil