05 Nov, 2020

1 commit

  • Some messages sent by the MDS entail a session sequence number
    increment, and the MDS will drop certain types of requests on the floor
    when the sequence numbers don't match.

    In particular, a REQUEST_CLOSE message can cross with one of the
    sequence morphing messages from the MDS which can cause the client to
    stall, waiting for a response that will never come.

    Originally, this meant an up to 5s delay before the recurring workqueue
    job kicked in and resent the request, but a recent change made it so
    that the client would never resend, causing a 60s stall unmounting and
    sometimes a blockisting event.

    Add a new helper for incrementing the session sequence and then testing
    to see whether a REQUEST_CLOSE needs to be resent, and move the handling
    of CEPH_MDS_SESSION_CLOSING into that function. Change all of the
    bare sequence counter increments to use the new helper.

    Reorganize check_session_state with a switch statement. It should no
    longer be called when the session is CLOSING, so throw a warning if it
    ever is (but still handle that case sanely).

    [ idryomov: whitespace, pr_err() call fixup ]

    URL: https://tracker.ceph.com/issues/47563
    Fixes: fa9967734227 ("ceph: fix potential mdsc use-after-free crash")
    Reported-by: Patrick Donnelly
    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Reviewed-by: Xiubo Li
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

12 Oct, 2020

1 commit


24 Aug, 2020

1 commit

  • Tuan and Ulrich mentioned that they were hitting a problem on s390x,
    which has a 32-bit ino_t value, even though it's a 64-bit arch (for
    historical reasons).

    I think the current handling of inode numbers in the ceph driver is
    wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's
    actually not a problem. 32-bit arches can deal with 64-bit inode numbers
    just fine when userland code is compiled with LFS support (the common
    case these days).

    What we really want to do is just use 64-bit numbers everywhere, unless
    someone has mounted with the ino32 mount option. In that case, we want
    to ensure that we hash the inode number down to something that will fit
    in 32 bits before presenting the value to userland.

    Add new helper functions that do this, and only do the conversion before
    presenting these values to userland in getattr and readdir.

    The inode table hashvalue is changed to just cast the inode number to
    unsigned long, as low-order bits are the most likely to vary anyway.

    While it's not strictly required, we do want to put something in
    inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on
    the size of the ino_t type.

    NOTE: This is a user-visible change on 32-bit arches:

    1/ inode numbers will be seen to have changed between kernel versions.
    32-bit arches will see large inode numbers now instead of the hashed
    ones they saw before.

    2/ any really old software not built with LFS support may start failing
    stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we
    can do about these, but hopefully the intersection of people running
    such code on ceph will be very small.

    The workaround for both problems is to mount with "-o ino32".

    [ idryomov: changelog tweak ]

    URL: https://tracker.ceph.com/issues/46828
    Reported-by: Ulrich Weigand
    Reported-and-Tested-by: Tuan Hoang1
    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

03 Aug, 2020

2 commits

  • This will send the caps/read/write/metadata metrics to any available MDS
    once per second, which will be the same as the userland client. It will
    skip the MDS sessions which don't support the metric collection, as the
    MDSs will close socket connections when they get an unknown type
    message.

    We can disable the metric sending via the disable_send_metrics module
    parameter.

    [ jlayton: fix up endianness bug in ceph_mdsc_send_metrics() ]

    URL: https://tracker.ceph.com/issues/43215
    Signed-off-by: Xiubo Li
    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • And remove the unsed mdsc parameter to simplify the code.

    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     

01 Jun, 2020

5 commits

  • It make no sense to check the caps when reconnecting to mds. And
    for the async dirop caps, they will be put by its _cb() function,
    so when releasing the requests, it will make no sense too.

    URL: https://tracker.ceph.com/issues/45635
    Signed-off-by: Xiubo Li
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • The mdsc->cap_dirty_lock is not held while walking the list in
    ceph_kick_flushing_caps, which is not safe.

    ceph_early_kick_flushing_caps does something similar, but the
    s_mutex is held while it's called and I think that guards against
    changes to the list.

    Ensure we hold the s_mutex when calling ceph_kick_flushing_caps,
    and add some clarifying comments.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • This is a per-sb list now, but that makes it difficult to tell when
    the cap is the last dirty one associated with the session. Switch
    this to be a per-session list, but continue using the
    mdsc->cap_dirty_lock to protect the lists.

    This list is only ever walked in ceph_flush_dirty_caps, so change that
    to walk the sessions array and then flush the caps for inodes on each
    session's list.

    If the auth cap ever changes while the inode has dirty caps, then
    move the inode to the appropriate session for the new auth_cap. Also,
    ensure that we never remove an auth cap while the inode is still on the
    s_cap_dirty list.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Add a new "r_ended" field to struct ceph_mds_request and use that to
    maintain the average latency of MDS requests.

    URL: https://tracker.ceph.com/issues/43215
    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • For dentry leases, only count the hit/miss info triggered from the vfs
    calls. For the cases like request reply handling and ceph_trim_dentries,
    ignore them.

    For now, these are only viewable using debugfs. Future patches will
    allow the client to send the stats to the MDS.

    The output looks like:

    item total miss hit
    -------------------------------------------------
    d_lease 11 7 141

    URL: https://tracker.ceph.com/issues/43215
    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     

14 Apr, 2020

1 commit

  • The new async dirops callback routines can pass ERR_PTR values to
    ceph_mdsc_free_path, which could cause an oops. Make ceph_mdsc_free_path
    ignore ERR_PTR values. Also, ensure that the pr_warn messages look sane
    even if ceph_mdsc_build_path fails.

    Reported-by: Dan Carpenter
    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

30 Mar, 2020

6 commits

  • Add new request field to hold the delegated inode number. Encode that
    into the message when it's set.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Starting in Octopus, the MDS will hand out caps that allow the client
    to do asynchronous file creates under certain conditions. As part of
    that, the MDS will delegate ranges of inode numbers to the client.

    Add the infrastructure to decode these ranges, and stuff them into an
    xarray for later consumption by the async creation code.

    Because the xarray code currently only handles unsigned long indexes,
    and those are 32-bits on 32-bit arches, we only enable the decoding when
    running on a 64-bit arch.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Track and correctly handle directory caps for asynchronous operations.
    Add aliases for Frc caps that we now designate at Dcu caps (when dealing
    with directories).

    Unlike file caps, we don't reclaim these when the session goes away, and
    instead preemptively release them. In-flight async dirops are instead
    handled during reconnect phase. The client needs to re-do a synchronous
    operation in order to re-get directory caps.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • When we issue an async create, we must ensure that any later on-the-wire
    requests involving it wait for the create reply.

    Expand i_ceph_flags to be an unsigned long, and add a new bit that
    MDS requests can wait on. If the bit is set in the inode when sending
    caps, then don't send it and just return that it has been delayed.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • ...and ensure that such requests are never queued. The MDS has need to
    know that a request is asynchronous so add flags and proper
    infrastructure for that.

    Also, delegated inode numbers and directory caps are associated with the
    session, so ensure that async requests are always transmitted on the
    first attempt and are never queued to wait for session reestablishment.

    If it does end up looking like we'll need to queue the request, then
    have it return -EJUKEBOX so the caller can reattempt with a synchronous
    request.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • This shrinks the struct size by 16 bytes.

    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

27 Jan, 2020

4 commits


10 Dec, 2019

1 commit


16 Sep, 2019

2 commits

  • It's only used to keep count of caps being trimmed, but that requires
    that we hold the session->s_mutex to prevent multiple trimming
    operations from running concurrently.

    We can achieve the same effect using an integer on the stack, which
    allows us to (eventually) not need the s_mutex.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • CEPH_MDS_SESSION_{RESTARTING,RECONNECTING} are for for mds failover,
    they are sub-states of CEPH_MDS_SESSION_OPEN. So __close_session()
    should send close request for session in these two state.

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     

08 Jul, 2019

4 commits


08 May, 2019

4 commits

  • Nothing calls ceph_mdsc_submit_request today, but in later patches we'll
    need to be able to call this separately.

    Have the helper return an int so we can check the r_err under the mutex,
    and have the caller just check the error code from the submit. Also move
    the acquisition of CEPH_CAP_PIN references into the same function.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Al suggested we get rid of the kmalloc here and just use __getname
    and __putname to get a full PATH_MAX pathname buffer.

    Since we build the path in reverse, we continue to return a pointer
    to the beginning of the string and the length, and add a new helper
    to free the thing at the end.

    Suggested-by: Al Viro
    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • The CephFS kernel client does not enforce quotas set in a directory that
    isn't visible from the mount point. For example, given the path
    '/dir1/dir2', if quotas are set in 'dir1' and the filesystem is mounted with

    mount -t ceph ::/dir1/ /mnt

    then the client won't be able to access 'dir1' inode, even if 'dir2' belongs
    to a quota realm that points to it.

    This patch fixes this issue by simply doing an MDS LOOKUPINO operation for
    unknown inodes. Any inode reference obtained this way will be added to a
    list in ceph_mds_client, and will only be released when the filesystem is
    umounted.

    Link: https://tracker.ceph.com/issues/38482
    Reported-by: Hendrik Peyerl
    Signed-off-by: Luis Henriques
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Luis Henriques
     

06 Mar, 2019

8 commits

  • If number of caps exceed the limit, ceph_trim_dentires() also trim
    dentries with valid leases. Trimming dentry releases references to
    associated inode, which may evict inode and release caps.

    By default, there is no limit for caps count.

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • Previous commit make VFS delete stale dentry when last reference is
    dropped. Lease also can become invalid when corresponding dentry has
    no reference. This patch make cephfs periodically scan lease list,
    delete corresponding dentry if lease is invalid.

    There are two types of lease, dentry lease and dir lease. dentry lease
    has life time and applies to singe dentry. Dentry lease is added to tail
    of a list when it's updated, leases at front of the list will expire
    first. Dir lease is CEPH_CAP_FILE_SHARED on directory inode, it applies
    to all dentries in the directory. Dentries have dir leases are added to
    another list. Dentries in the list are periodically checked in a round
    robin manner.

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • When pending cap releases fill up one message, start a work to send
    cap release message. (old way is sending cap releases every 5 seconds)

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • Link: http://tracker.ceph.com/issues/37576
    Signed-off-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • In versioned reply, inodestat, dirstat and lease are encoded with
    version, compat_version and struct_len.

    Based on a patch from Jos Collin .

    Link: http://tracker.ceph.com/issues/26936
    Signed-off-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • ceph_getattr() return zero dev ID for head inodes and set dev ID to
    snapid directly for snaphost inodes. This is not good because userspace
    utilities may consider device ID of 0 as invalid, snapid may conflict
    with other device's ID.

    This patch introduces "snapids to anonymous bdev IDs" map. we create a
    new mapping when we see a snapid for the first time. we trim unused
    mapping after it is ilde for 5 minutes.

    Link: http://tracker.ceph.com/issues/22353
    Signed-off-by: "Yan, Zheng"
    Acked-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • Signed-off-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • Signed-off-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Yan, Zheng