05 Nov, 2020

1 commit

  • Some messages sent by the MDS entail a session sequence number
    increment, and the MDS will drop certain types of requests on the floor
    when the sequence numbers don't match.

    In particular, a REQUEST_CLOSE message can cross with one of the
    sequence morphing messages from the MDS which can cause the client to
    stall, waiting for a response that will never come.

    Originally, this meant an up to 5s delay before the recurring workqueue
    job kicked in and resent the request, but a recent change made it so
    that the client would never resend, causing a 60s stall unmounting and
    sometimes a blockisting event.

    Add a new helper for incrementing the session sequence and then testing
    to see whether a REQUEST_CLOSE needs to be resent, and move the handling
    of CEPH_MDS_SESSION_CLOSING into that function. Change all of the
    bare sequence counter increments to use the new helper.

    Reorganize check_session_state with a switch statement. It should no
    longer be called when the session is CLOSING, so throw a warning if it
    ever is (but still handle that case sanely).

    [ idryomov: whitespace, pr_err() call fixup ]

    URL: https://tracker.ceph.com/issues/47563
    Fixes: fa9967734227 ("ceph: fix potential mdsc use-after-free crash")
    Reported-by: Patrick Donnelly
    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Reviewed-by: Xiubo Li
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

12 Oct, 2020

3 commits

  • error_string key in the metadata map of MClientSession message
    is intended for humans, but unfortunately became part of the on-wire
    format with the introduction of recover_session=clean mode in commit
    131d7eb4faa1 ("ceph: auto reconnect after blacklisted").

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Since nautilus, MDS tracks dirfrags whose child inodes have caps in open
    file table. When MDS recovers, it prefetches all of these dirfrags. This
    avoids using backtrace to load inodes. But dirfrags prefetch may load
    lots of useless inodes into cache, and make MDS run out of memory.

    Recent MDS adds an option that disables dirfrags prefetch. When dirfrags
    prefetch is disabled. Recovering MDS only prefetches corresponding dir
    inodes. Including inodes' parent/d_name in cap reconnect message can
    help MDS to load inodes into its cache.

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     

05 Aug, 2020

2 commits

  • Most session messages contain a feature mask, but the MDS will
    routinely send a REJECT message with one that is zero-length.

    Commit 0fa8263367db ("ceph: fix endianness bug when handling MDS
    session feature bits") fixed the decoding of the feature mask,
    but failed to account for the MDS sending a zero-length feature
    mask. This causes REJECT message decoding to fail.

    Skip trying to decode a feature mask if the word count is zero.

    Cc: stable@vger.kernel.org
    URL: https://tracker.ceph.com/issues/46823
    Fixes: 0fa8263367db ("ceph: fix endianness bug when handling MDS session feature bits")
    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Tested-by: Patrick Donnelly
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • When doing some tests with multiple mds, we were seeing many mds
    forwarding requests between them, causing clients to resend.

    If the request is a modification operation and the mode is set to
    USE_AUTH_MDS, then the auth mds should be selected to handle the
    request. If auth mds for frag is already set, then it should be returned
    directly without further processing.

    The current logic is wrong because it only returns directly if
    mode is USE_AUTH_MDS, but we want to do that for all modes. If we don't,
    then when the frag's mds is not equal to cap session's mds, the request
    will get sent to the wrong MDS needlessly.

    Drop the mode check in this condition.

    Signed-off-by: Yanhu Cao
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yanhu Cao
     

03 Aug, 2020

8 commits


01 Jun, 2020

7 commits

  • It make no sense to check the caps when reconnecting to mds. And
    for the async dirop caps, they will be put by its _cb() function,
    so when releasing the requests, it will make no sense too.

    URL: https://tracker.ceph.com/issues/45635
    Signed-off-by: Xiubo Li
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • send_mds_reconnect takes the s_mutex while the mdsc->mutex is already
    held. That inverts the locking order documented in mds_client.h. Drop
    the mdsc->mutex, acquire the s_mutex and then reacquire the mdsc->mutex
    to prevent a deadlock.

    URL: https://tracker.ceph.com/issues/45609
    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • The mdsc->cap_dirty_lock is not held while walking the list in
    ceph_kick_flushing_caps, which is not safe.

    ceph_early_kick_flushing_caps does something similar, but the
    s_mutex is held while it's called and I think that guards against
    changes to the list.

    Ensure we hold the s_mutex when calling ceph_kick_flushing_caps,
    and add some clarifying comments.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • This is a per-sb list now, but that makes it difficult to tell when
    the cap is the last dirty one associated with the session. Switch
    this to be a per-session list, but continue using the
    mdsc->cap_dirty_lock to protect the lists.

    This list is only ever walked in ceph_flush_dirty_caps, so change that
    to walk the sessions array and then flush the caps for inodes on each
    session's list.

    If the auth cap ever changes while the inode has dirty caps, then
    move the inode to the appropriate session for the new auth_cap. Also,
    ensure that we never remove an auth cap while the inode is still on the
    s_cap_dirty list.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Add a new "r_ended" field to struct ceph_mds_request and use that to
    maintain the average latency of MDS requests.

    URL: https://tracker.ceph.com/issues/43215
    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • For dentry leases, only count the hit/miss info triggered from the vfs
    calls. For the cases like request reply handling and ceph_trim_dentries,
    ignore them.

    For now, these are only viewable using debugfs. Future patches will
    allow the client to send the stats to the MDS.

    The output looks like:

    item total miss hit
    -------------------------------------------------
    d_lease 11 7 141

    URL: https://tracker.ceph.com/issues/43215
    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     

05 May, 2020

1 commit


30 Mar, 2020

12 commits

  • Add i_last_rd and i_last_wr to ceph_inode_info. These fields are
    used to track the last time the client acquired read/write caps for
    the inode.

    If there is no read/write on an inode for 'caps_wanted_delay_max'
    seconds, __ceph_caps_file_wanted() does not request caps for read/write
    even there are open files.

    Call __ceph_touch_fmode() for dir operations. __ceph_caps_file_wanted()
    calculates dir's wanted caps according to last dir read/modification. If
    there is recent dir read, dir inode wants CEPH_CAP_ANY_SHARED caps. If
    there is recent dir modification, also wants CEPH_CAP_FILE_EXCL.

    Readdir is a special case. Dir inode wants CEPH_CAP_FILE_EXCL after
    readdir, as with that, modifications do not need to release
    CEPH_CAP_FILE_SHARED or invalidate all dentry leases issued by readdir.

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • Original code only renews caps for inodes with CEPH_I_CAP_DROPPED flag,
    which indicates that mds has closed the session and caps were dropped.
    Remove this flag in preparation for not requesting caps for idle open
    files.

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • If a create is done, then typically we'll end up writing to the file
    soon afterward. We don't want to wait for the reply before doing that
    when doing an async create, so that means we need the layout for the
    new file before we've gotten the response from the MDS.

    All files created in a directory will initially inherit the same layout,
    so copy off the requisite info from the first synchronous create in the
    directory, and save it in a new i_cached_layout field. Zero out the
    layout when we lose Dc caps in the dir.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Add new request field to hold the delegated inode number. Encode that
    into the message when it's set.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Starting in Octopus, the MDS will hand out caps that allow the client
    to do asynchronous file creates under certain conditions. As part of
    that, the MDS will delegate ranges of inode numbers to the client.

    Add the infrastructure to decode these ranges, and stuff them into an
    xarray for later consumption by the async creation code.

    Because the xarray code currently only handles unsigned long indexes,
    and those are 32-bits on 32-bit arches, we only enable the decoding when
    running on a 64-bit arch.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Track and correctly handle directory caps for asynchronous operations.
    Add aliases for Frc caps that we now designate at Dcu caps (when dealing
    with directories).

    Unlike file caps, we don't reclaim these when the session goes away, and
    instead preemptively release them. In-flight async dirops are instead
    handled during reconnect phase. The client needs to re-do a synchronous
    operation in order to re-get directory caps.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • When we issue an async create, we must ensure that any later on-the-wire
    requests involving it wait for the create reply.

    Expand i_ceph_flags to be an unsigned long, and add a new bit that
    MDS requests can wait on. If the bit is set in the inode when sending
    caps, then don't send it and just return that it has been delayed.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • ...and ensure that such requests are never queued. The MDS has need to
    know that a request is asynchronous so add flags and proper
    infrastructure for that.

    Also, delegated inode numbers and directory caps are associated with the
    session, so ensure that async requests are always transmitted on the
    first attempt and are never queued to wait for session reestablishment.

    If it does end up looking like we'll need to queue the request, then
    have it return -EJUKEBOX so the caller can reattempt with a synchronous
    request.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • req->r_timeout is only used during mounting, so this error will
    be more accurate.

    URL: https://tracker.ceph.com/issues/44215
    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • On my machine (x86_64) this struct is 952 bytes, which gets rounded up
    to 1024 by kmalloc. Move this to a dedicated slabcache, so we can
    allocate them without the extra 72 bytes of overhead per.

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • These bits will have new meaning for directory inodes.

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • When the unsafe reply to a request comes in, the request is put on the
    r_unsafe_dir inode's list. In future patches, we're going to need to
    wait on requests that may not have gotten an unsafe reply yet.

    Change __register_request to put the entry on the dir inode's list when
    the pointer is set in the request, and don't check the
    CEPH_MDS_R_GOT_UNSAFE flag when unregistering it.

    The only place that uses this list today is fsync codepath, and with
    the coming changes, we'll want to wait on all operations whether it has
    gotten an unsafe reply or not.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

06 Feb, 2020

1 commit

  • Pull ceph fixes from Ilya Dryomov:

    - a set of patches that fixes various corner cases in mount and umount
    code (Xiubo Li). This has to do with choosing an MDS, distinguishing
    between laggy and down MDSes and parsing the server path.

    - inode initialization fixes (Jeff Layton). The one included here
    mostly concerns things like open_by_handle() and there is another one
    that will come through Al.

    - copy_file_range() now uses the new copy-from2 op (Luis Henriques).
    The existing copy-from op turned out to be infeasible for generic
    filesystem use; we disable the copy offload if OSDs don't support
    copy-from2.

    - a patch to link "rbd" and "block" devices together in sysfs (Hannes
    Reinecke)

    ... and a smattering of cleanups from Xiubo, Jeff and Chengguang.

    * tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client: (25 commits)
    rbd: set the 'device' link in sysfs
    ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
    ceph: print name of xattr in __ceph_{get,set}xattr() douts
    ceph: print r_direct_hash in hex in __choose_mds() dout
    ceph: use copy-from2 op in copy_file_range
    ceph: close holes in structs ceph_mds_session and ceph_mds_request
    rbd: work around -Wuninitialized warning
    ceph: allocate the correct amount of extra bytes for the session features
    ceph: rename get_session and switch to use ceph_get_mds_session
    ceph: remove the extra slashes in the server path
    ceph: add possible_max_rank and make the code more readable
    ceph: print dentry offset in hex and fix xattr_version type
    ceph: only touch the caps which have the subset mask requested
    ceph: don't clear I_NEW until inode metadata is fully populated
    ceph: retry the same mds later after the new session is opened
    ceph: check availability of mds cluster on mount after wait timeout
    ceph: keep the session state until it is released
    ceph: add __send_request helper
    ceph: ensure we have a new cap before continuing in fill_inode
    ceph: drop unused ttl_from parameter from fill_inode
    ...

    Linus Torvalds
     

05 Feb, 2020

1 commit

  • Pull vfs timestamp updates from Al Viro:
    "More 64bit timestamp work"

    * 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    kernfs: don't bother with timestamp truncation
    fs: Do not overload update_time
    fs: Delete timespec64_trunc()
    fs: ubifs: Eliminate timespec64_trunc() usage
    fs: ceph: Delete timespec64_trunc() usage
    fs: cifs: Delete usage of timespec64_trunc
    fs: fat: Eliminate timespec64_trunc() usage
    utimes: Clamp the timestamps in notify_change()

    Linus Torvalds
     

27 Jan, 2020

4 commits