05 Nov, 2020
1 commit
-
Some messages sent by the MDS entail a session sequence number
increment, and the MDS will drop certain types of requests on the floor
when the sequence numbers don't match.In particular, a REQUEST_CLOSE message can cross with one of the
sequence morphing messages from the MDS which can cause the client to
stall, waiting for a response that will never come.Originally, this meant an up to 5s delay before the recurring workqueue
job kicked in and resent the request, but a recent change made it so
that the client would never resend, causing a 60s stall unmounting and
sometimes a blockisting event.Add a new helper for incrementing the session sequence and then testing
to see whether a REQUEST_CLOSE needs to be resent, and move the handling
of CEPH_MDS_SESSION_CLOSING into that function. Change all of the
bare sequence counter increments to use the new helper.Reorganize check_session_state with a switch statement. It should no
longer be called when the session is CLOSING, so throw a warning if it
ever is (but still handle that case sanely).[ idryomov: whitespace, pr_err() call fixup ]
URL: https://tracker.ceph.com/issues/47563
Fixes: fa9967734227 ("ceph: fix potential mdsc use-after-free crash")
Reported-by: Patrick Donnelly
Signed-off-by: Jeff Layton
Reviewed-by: Ilya Dryomov
Reviewed-by: Xiubo Li
Signed-off-by: Ilya Dryomov
12 Oct, 2020
3 commits
-
error_string key in the metadata map of MClientSession message
is intended for humans, but unfortunately became part of the on-wire
format with the introduction of recover_session=clean mode in commit
131d7eb4faa1 ("ceph: auto reconnect after blacklisted").Signed-off-by: Ilya Dryomov
-
Signed-off-by: Ilya Dryomov
-
Since nautilus, MDS tracks dirfrags whose child inodes have caps in open
file table. When MDS recovers, it prefetches all of these dirfrags. This
avoids using backtrace to load inodes. But dirfrags prefetch may load
lots of useless inodes into cache, and make MDS run out of memory.Recent MDS adds an option that disables dirfrags prefetch. When dirfrags
prefetch is disabled. Recovering MDS only prefetches corresponding dir
inodes. Including inodes' parent/d_name in cap reconnect message can
help MDS to load inodes into its cache.Signed-off-by: "Yan, Zheng"
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov
05 Aug, 2020
2 commits
-
Most session messages contain a feature mask, but the MDS will
routinely send a REJECT message with one that is zero-length.Commit 0fa8263367db ("ceph: fix endianness bug when handling MDS
session feature bits") fixed the decoding of the feature mask,
but failed to account for the MDS sending a zero-length feature
mask. This causes REJECT message decoding to fail.Skip trying to decode a feature mask if the word count is zero.
Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46823
Fixes: 0fa8263367db ("ceph: fix endianness bug when handling MDS session feature bits")
Signed-off-by: Jeff Layton
Reviewed-by: Ilya Dryomov
Tested-by: Patrick Donnelly
Signed-off-by: Ilya Dryomov -
When doing some tests with multiple mds, we were seeing many mds
forwarding requests between them, causing clients to resend.If the request is a modification operation and the mode is set to
USE_AUTH_MDS, then the auth mds should be selected to handle the
request. If auth mds for frag is already set, then it should be returned
directly without further processing.The current logic is wrong because it only returns directly if
mode is USE_AUTH_MDS, but we want to do that for all modes. If we don't,
then when the frag's mds is not equal to cap session's mds, the request
will get sent to the wrong MDS needlessly.Drop the mode check in this condition.
Signed-off-by: Yanhu Cao
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov
03 Aug, 2020
8 commits
-
If the ceph_mdsc_init() fails, it will free the mdsc already.
Reported-by: syzbot+b57f46d8d6ea51960b8c@syzkaller.appspotmail.com
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Send metric flags to the MDS, indicating what metrics the client
supports. Currently that consists of cap statistics, and read, write and
metadata latencies.URL: https://tracker.ceph.com/issues/43435
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
This will send the caps/read/write/metadata metrics to any available MDS
once per second, which will be the same as the userland client. It will
skip the MDS sessions which don't support the metric collection, as the
MDSs will close socket connections when they get an unknown type
message.We can disable the metric sending via the disable_send_metrics module
parameter.[ jlayton: fix up endianness bug in ceph_mdsc_send_metrics() ]
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li
Signed-off-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
If the session is already in closed state, we should skip it.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Make sure the delayed work stopped before releasing the resources.
cancel_delayed_work_sync() will only guarantee that the work finishes
executing if the work is already in the ->worklist. That means after
the cancel_delayed_work_sync() returns, it will leave the work requeued
if it was rearmed at the end. That can lead to a use after free once the
work struct is freed.Fix it by flushing the delayed work instead of trying to cancel it, and
ensure that the work doesn't rearm if the mdsc is stopping.URL: https://tracker.ceph.com/issues/46293
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
...and let the errnos bubble up to the callers.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
This will help to reduce using the global mdsc->mutex lock in many
places.Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
And remove the unsed mdsc parameter to simplify the code.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov
01 Jun, 2020
7 commits
-
It make no sense to check the caps when reconnecting to mds. And
for the async dirop caps, they will be put by its _cb() function,
so when releasing the requests, it will make no sense too.URL: https://tracker.ceph.com/issues/45635
Signed-off-by: Xiubo Li
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
send_mds_reconnect takes the s_mutex while the mdsc->mutex is already
held. That inverts the locking order documented in mds_client.h. Drop
the mdsc->mutex, acquire the s_mutex and then reacquire the mdsc->mutex
to prevent a deadlock.URL: https://tracker.ceph.com/issues/45609
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
The mdsc->cap_dirty_lock is not held while walking the list in
ceph_kick_flushing_caps, which is not safe.ceph_early_kick_flushing_caps does something similar, but the
s_mutex is held while it's called and I think that guards against
changes to the list.Ensure we hold the s_mutex when calling ceph_kick_flushing_caps,
and add some clarifying comments.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
This is a per-sb list now, but that makes it difficult to tell when
the cap is the last dirty one associated with the session. Switch
this to be a per-session list, but continue using the
mdsc->cap_dirty_lock to protect the lists.This list is only ever walked in ceph_flush_dirty_caps, so change that
to walk the sessions array and then flush the caps for inodes on each
session's list.If the auth cap ever changes while the inode has dirty caps, then
move the inode to the appropriate session for the new auth_cap. Also,
ensure that we never remove an auth cap while the inode is still on the
s_cap_dirty list.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Signed-off-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Add a new "r_ended" field to struct ceph_mds_request and use that to
maintain the average latency of MDS requests.URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
For dentry leases, only count the hit/miss info triggered from the vfs
calls. For the cases like request reply handling and ceph_trim_dentries,
ignore them.For now, these are only viewable using debugfs. Future patches will
allow the client to send the stats to the MDS.The output looks like:
item total miss hit
-------------------------------------------------
d_lease 11 7 141URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov
05 May, 2020
1 commit
-
Eduard reported a problem mounting cephfs on s390 arch. The feature
mask sent by the MDS is little-endian, so we need to convert it
before storing and testing against it.Cc: stable@vger.kernel.org
Reported-and-Tested-by: Eduard Shishkin
Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
30 Mar, 2020
12 commits
-
Add i_last_rd and i_last_wr to ceph_inode_info. These fields are
used to track the last time the client acquired read/write caps for
the inode.If there is no read/write on an inode for 'caps_wanted_delay_max'
seconds, __ceph_caps_file_wanted() does not request caps for read/write
even there are open files.Call __ceph_touch_fmode() for dir operations. __ceph_caps_file_wanted()
calculates dir's wanted caps according to last dir read/modification. If
there is recent dir read, dir inode wants CEPH_CAP_ANY_SHARED caps. If
there is recent dir modification, also wants CEPH_CAP_FILE_EXCL.Readdir is a special case. Dir inode wants CEPH_CAP_FILE_EXCL after
readdir, as with that, modifications do not need to release
CEPH_CAP_FILE_SHARED or invalidate all dentry leases issued by readdir.Signed-off-by: "Yan, Zheng"
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Original code only renews caps for inodes with CEPH_I_CAP_DROPPED flag,
which indicates that mds has closed the session and caps were dropped.
Remove this flag in preparation for not requesting caps for idle open
files.Signed-off-by: "Yan, Zheng"
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
If a create is done, then typically we'll end up writing to the file
soon afterward. We don't want to wait for the reply before doing that
when doing an async create, so that means we need the layout for the
new file before we've gotten the response from the MDS.All files created in a directory will initially inherit the same layout,
so copy off the requisite info from the first synchronous create in the
directory, and save it in a new i_cached_layout field. Zero out the
layout when we lose Dc caps in the dir.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Add new request field to hold the delegated inode number. Encode that
into the message when it's set.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Starting in Octopus, the MDS will hand out caps that allow the client
to do asynchronous file creates under certain conditions. As part of
that, the MDS will delegate ranges of inode numbers to the client.Add the infrastructure to decode these ranges, and stuff them into an
xarray for later consumption by the async creation code.Because the xarray code currently only handles unsigned long indexes,
and those are 32-bits on 32-bit arches, we only enable the decoding when
running on a 64-bit arch.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Track and correctly handle directory caps for asynchronous operations.
Add aliases for Frc caps that we now designate at Dcu caps (when dealing
with directories).Unlike file caps, we don't reclaim these when the session goes away, and
instead preemptively release them. In-flight async dirops are instead
handled during reconnect phase. The client needs to re-do a synchronous
operation in order to re-get directory caps.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
When we issue an async create, we must ensure that any later on-the-wire
requests involving it wait for the create reply.Expand i_ceph_flags to be an unsigned long, and add a new bit that
MDS requests can wait on. If the bit is set in the inode when sending
caps, then don't send it and just return that it has been delayed.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
...and ensure that such requests are never queued. The MDS has need to
know that a request is asynchronous so add flags and proper
infrastructure for that.Also, delegated inode numbers and directory caps are associated with the
session, so ensure that async requests are always transmitted on the
first attempt and are never queued to wait for session reestablishment.If it does end up looking like we'll need to queue the request, then
have it return -EJUKEBOX so the caller can reattempt with a synchronous
request.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
req->r_timeout is only used during mounting, so this error will
be more accurate.URL: https://tracker.ceph.com/issues/44215
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
On my machine (x86_64) this struct is 952 bytes, which gets rounded up
to 1024 by kmalloc. Move this to a dedicated slabcache, so we can
allocate them without the extra 72 bytes of overhead per.Signed-off-by: Jeff Layton
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov -
These bits will have new meaning for directory inodes.
Signed-off-by: "Yan, Zheng"
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
When the unsafe reply to a request comes in, the request is put on the
r_unsafe_dir inode's list. In future patches, we're going to need to
wait on requests that may not have gotten an unsafe reply yet.Change __register_request to put the entry on the dir inode's list when
the pointer is set in the request, and don't check the
CEPH_MDS_R_GOT_UNSAFE flag when unregistering it.The only place that uses this list today is fsync codepath, and with
the coming changes, we'll want to wait on all operations whether it has
gotten an unsafe reply or not.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
06 Feb, 2020
1 commit
-
Pull ceph fixes from Ilya Dryomov:
- a set of patches that fixes various corner cases in mount and umount
code (Xiubo Li). This has to do with choosing an MDS, distinguishing
between laggy and down MDSes and parsing the server path.- inode initialization fixes (Jeff Layton). The one included here
mostly concerns things like open_by_handle() and there is another one
that will come through Al.- copy_file_range() now uses the new copy-from2 op (Luis Henriques).
The existing copy-from op turned out to be infeasible for generic
filesystem use; we disable the copy offload if OSDs don't support
copy-from2.- a patch to link "rbd" and "block" devices together in sysfs (Hannes
Reinecke)... and a smattering of cleanups from Xiubo, Jeff and Chengguang.
* tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client: (25 commits)
rbd: set the 'device' link in sysfs
ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
ceph: print name of xattr in __ceph_{get,set}xattr() douts
ceph: print r_direct_hash in hex in __choose_mds() dout
ceph: use copy-from2 op in copy_file_range
ceph: close holes in structs ceph_mds_session and ceph_mds_request
rbd: work around -Wuninitialized warning
ceph: allocate the correct amount of extra bytes for the session features
ceph: rename get_session and switch to use ceph_get_mds_session
ceph: remove the extra slashes in the server path
ceph: add possible_max_rank and make the code more readable
ceph: print dentry offset in hex and fix xattr_version type
ceph: only touch the caps which have the subset mask requested
ceph: don't clear I_NEW until inode metadata is fully populated
ceph: retry the same mds later after the new session is opened
ceph: check availability of mds cluster on mount after wait timeout
ceph: keep the session state until it is released
ceph: add __send_request helper
ceph: ensure we have a new cap before continuing in fill_inode
ceph: drop unused ttl_from parameter from fill_inode
...
05 Feb, 2020
1 commit
-
Pull vfs timestamp updates from Al Viro:
"More 64bit timestamp work"* 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kernfs: don't bother with timestamp truncation
fs: Do not overload update_time
fs: Delete timespec64_trunc()
fs: ubifs: Eliminate timespec64_trunc() usage
fs: ceph: Delete timespec64_trunc() usage
fs: cifs: Delete usage of timespec64_trunc
fs: fat: Eliminate timespec64_trunc() usage
utimes: Clamp the timestamps in notify_change()
27 Jan, 2020
4 commits
-
It's hard to read, especially when it is:
ceph: __choose_mds 00000000b7bc9c15 is_hash=1 (-271041095) mode 0
At the same time, switch to __func__ to get rid of the checkpatch
warning.Signed-off-by: Xiubo Li
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
The total bytes may potentially be larger than 8.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Just in case the session's refcount reach 0 and is releasing, and
if we get the session without checking it, we may encounter kernel
crash.Rename get_session to ceph_get_mds_session and make it global.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
The m_num_mds here is actually the number for MDSs which are in
up:active status, and it will be duplicated to m_num_active_mds,
so remove it.Add possible_max_rank to the mdsmap struct and this will be
the correctly possible largest rank boundary.Remove the special case for one mds in __mdsmap_get_random_mds(),
because the validate mds rank may not always be 0.Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov