05 Nov, 2020
1 commit
-
Some messages sent by the MDS entail a session sequence number
increment, and the MDS will drop certain types of requests on the floor
when the sequence numbers don't match.In particular, a REQUEST_CLOSE message can cross with one of the
sequence morphing messages from the MDS which can cause the client to
stall, waiting for a response that will never come.Originally, this meant an up to 5s delay before the recurring workqueue
job kicked in and resent the request, but a recent change made it so
that the client would never resend, causing a 60s stall unmounting and
sometimes a blockisting event.Add a new helper for incrementing the session sequence and then testing
to see whether a REQUEST_CLOSE needs to be resent, and move the handling
of CEPH_MDS_SESSION_CLOSING into that function. Change all of the
bare sequence counter increments to use the new helper.Reorganize check_session_state with a switch statement. It should no
longer be called when the session is CLOSING, so throw a warning if it
ever is (but still handle that case sanely).[ idryomov: whitespace, pr_err() call fixup ]
URL: https://tracker.ceph.com/issues/47563
Fixes: fa9967734227 ("ceph: fix potential mdsc use-after-free crash")
Reported-by: Patrick Donnelly
Signed-off-by: Jeff Layton
Reviewed-by: Ilya Dryomov
Reviewed-by: Xiubo Li
Signed-off-by: Ilya Dryomov
12 Oct, 2020
1 commit
-
Signed-off-by: Jeff Layton
Signed-off-by: Ilya Dryomov
24 Aug, 2020
1 commit
-
Tuan and Ulrich mentioned that they were hitting a problem on s390x,
which has a 32-bit ino_t value, even though it's a 64-bit arch (for
historical reasons).I think the current handling of inode numbers in the ceph driver is
wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's
actually not a problem. 32-bit arches can deal with 64-bit inode numbers
just fine when userland code is compiled with LFS support (the common
case these days).What we really want to do is just use 64-bit numbers everywhere, unless
someone has mounted with the ino32 mount option. In that case, we want
to ensure that we hash the inode number down to something that will fit
in 32 bits before presenting the value to userland.Add new helper functions that do this, and only do the conversion before
presenting these values to userland in getattr and readdir.The inode table hashvalue is changed to just cast the inode number to
unsigned long, as low-order bits are the most likely to vary anyway.While it's not strictly required, we do want to put something in
inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on
the size of the ino_t type.NOTE: This is a user-visible change on 32-bit arches:
1/ inode numbers will be seen to have changed between kernel versions.
32-bit arches will see large inode numbers now instead of the hashed
ones they saw before.2/ any really old software not built with LFS support may start failing
stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we
can do about these, but hopefully the intersection of people running
such code on ceph will be very small.The workaround for both problems is to mount with "-o ino32".
[ idryomov: changelog tweak ]
URL: https://tracker.ceph.com/issues/46828
Reported-by: Ulrich Weigand
Reported-and-Tested-by: Tuan Hoang1
Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
03 Aug, 2020
2 commits
-
This will send the caps/read/write/metadata metrics to any available MDS
once per second, which will be the same as the userland client. It will
skip the MDS sessions which don't support the metric collection, as the
MDSs will close socket connections when they get an unknown type
message.We can disable the metric sending via the disable_send_metrics module
parameter.[ jlayton: fix up endianness bug in ceph_mdsc_send_metrics() ]
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li
Signed-off-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
And remove the unsed mdsc parameter to simplify the code.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov
01 Jun, 2020
5 commits
-
It make no sense to check the caps when reconnecting to mds. And
for the async dirop caps, they will be put by its _cb() function,
so when releasing the requests, it will make no sense too.URL: https://tracker.ceph.com/issues/45635
Signed-off-by: Xiubo Li
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
The mdsc->cap_dirty_lock is not held while walking the list in
ceph_kick_flushing_caps, which is not safe.ceph_early_kick_flushing_caps does something similar, but the
s_mutex is held while it's called and I think that guards against
changes to the list.Ensure we hold the s_mutex when calling ceph_kick_flushing_caps,
and add some clarifying comments.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
This is a per-sb list now, but that makes it difficult to tell when
the cap is the last dirty one associated with the session. Switch
this to be a per-session list, but continue using the
mdsc->cap_dirty_lock to protect the lists.This list is only ever walked in ceph_flush_dirty_caps, so change that
to walk the sessions array and then flush the caps for inodes on each
session's list.If the auth cap ever changes while the inode has dirty caps, then
move the inode to the appropriate session for the new auth_cap. Also,
ensure that we never remove an auth cap while the inode is still on the
s_cap_dirty list.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Add a new "r_ended" field to struct ceph_mds_request and use that to
maintain the average latency of MDS requests.URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
For dentry leases, only count the hit/miss info triggered from the vfs
calls. For the cases like request reply handling and ceph_trim_dentries,
ignore them.For now, these are only viewable using debugfs. Future patches will
allow the client to send the stats to the MDS.The output looks like:
item total miss hit
-------------------------------------------------
d_lease 11 7 141URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov
14 Apr, 2020
1 commit
-
The new async dirops callback routines can pass ERR_PTR values to
ceph_mdsc_free_path, which could cause an oops. Make ceph_mdsc_free_path
ignore ERR_PTR values. Also, ensure that the pr_warn messages look sane
even if ceph_mdsc_build_path fails.Reported-by: Dan Carpenter
Signed-off-by: Jeff Layton
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov
30 Mar, 2020
6 commits
-
Add new request field to hold the delegated inode number. Encode that
into the message when it's set.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Starting in Octopus, the MDS will hand out caps that allow the client
to do asynchronous file creates under certain conditions. As part of
that, the MDS will delegate ranges of inode numbers to the client.Add the infrastructure to decode these ranges, and stuff them into an
xarray for later consumption by the async creation code.Because the xarray code currently only handles unsigned long indexes,
and those are 32-bits on 32-bit arches, we only enable the decoding when
running on a 64-bit arch.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Track and correctly handle directory caps for asynchronous operations.
Add aliases for Frc caps that we now designate at Dcu caps (when dealing
with directories).Unlike file caps, we don't reclaim these when the session goes away, and
instead preemptively release them. In-flight async dirops are instead
handled during reconnect phase. The client needs to re-do a synchronous
operation in order to re-get directory caps.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
When we issue an async create, we must ensure that any later on-the-wire
requests involving it wait for the create reply.Expand i_ceph_flags to be an unsigned long, and add a new bit that
MDS requests can wait on. If the bit is set in the inode when sending
caps, then don't send it and just return that it has been delayed.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
...and ensure that such requests are never queued. The MDS has need to
know that a request is asynchronous so add flags and proper
infrastructure for that.Also, delegated inode numbers and directory caps are associated with the
session, so ensure that async requests are always transmitted on the
first attempt and are never queued to wait for session reestablishment.If it does end up looking like we'll need to queue the request, then
have it return -EJUKEBOX so the caller can reattempt with a synchronous
request.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
This shrinks the struct size by 16 bytes.
Signed-off-by: Jeff Layton
Signed-off-by: Ilya Dryomov
27 Jan, 2020
4 commits
-
Move s_ref up to plug a 4 byte hole, which plugs another.
Move r_kref to shave 8 bytes off per request on x86_64.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
The total bytes may potentially be larger than 8.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Just in case the session's refcount reach 0 and is releasing, and
if we get the session without checking it, we may encounter kernel
crash.Rename get_session to ceph_get_mds_session and make it global.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
When reconnecting the session but if it is denied by the MDS due
to client was in blacklist or something else, kclient will receive
a session close reply, and we will never see the important log:"ceph: mds%d reconnect denied"
And with the confusing log:
"ceph: handle_session mds0 close 0000000085804730 state ??? seq 0"
Let's keep the session state until its memories is released.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov
10 Dec, 2019
1 commit
-
Add some visibility of tasks that are waiting for caps to the "caps"
debugfs file. Display the tgid of the waiting task, inode number, and
the caps the task needs and wants.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
16 Sep, 2019
2 commits
-
It's only used to keep count of caps being trimmed, but that requires
that we hold the session->s_mutex to prevent multiple trimming
operations from running concurrently.We can achieve the same effect using an integer on the stack, which
allows us to (eventually) not need the s_mutex.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
CEPH_MDS_SESSION_{RESTARTING,RECONNECTING} are for for mds failover,
they are sub-states of CEPH_MDS_SESSION_OPEN. So __close_session()
should send close request for session in these two state.Signed-off-by: "Yan, Zheng"
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov
08 Jul, 2019
4 commits
-
Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Signed-off-by: "Yan, Zheng"
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
MDS InodeStat v3 wire structures include a trailing snapshot creation
time member. Unmarshall this and retain it for a future vxattr.Signed-off-by: David Disseldorp
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
08 May, 2019
4 commits
-
Nothing calls ceph_mdsc_submit_request today, but in later patches we'll
need to be able to call this separately.Have the helper return an int so we can check the r_err under the mutex,
and have the caller just check the error code from the submit. Also move
the acquisition of CEPH_CAP_PIN references into the same function.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Al suggested we get rid of the kmalloc here and just use __getname
and __putname to get a full PATH_MAX pathname buffer.Since we build the path in reverse, we continue to return a pointer
to the beginning of the string and the length, and add a new helper
to free the thing at the end.Suggested-by: Al Viro
Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
The CephFS kernel client does not enforce quotas set in a directory that
isn't visible from the mount point. For example, given the path
'/dir1/dir2', if quotas are set in 'dir1' and the filesystem is mounted withmount -t ceph ::/dir1/ /mnt
then the client won't be able to access 'dir1' inode, even if 'dir2' belongs
to a quota realm that points to it.This patch fixes this issue by simply doing an MDS LOOKUPINO operation for
unknown inodes. Any inode reference obtained this way will be added to a
list in ceph_mds_client, and will only be released when the filesystem is
umounted.Link: https://tracker.ceph.com/issues/38482
Reported-by: Hendrik Peyerl
Signed-off-by: Luis Henriques
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
06 Mar, 2019
8 commits
-
If number of caps exceed the limit, ceph_trim_dentires() also trim
dentries with valid leases. Trimming dentry releases references to
associated inode, which may evict inode and release caps.By default, there is no limit for caps count.
Signed-off-by: "Yan, Zheng"
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Previous commit make VFS delete stale dentry when last reference is
dropped. Lease also can become invalid when corresponding dentry has
no reference. This patch make cephfs periodically scan lease list,
delete corresponding dentry if lease is invalid.There are two types of lease, dentry lease and dir lease. dentry lease
has life time and applies to singe dentry. Dentry lease is added to tail
of a list when it's updated, leases at front of the list will expire
first. Dir lease is CEPH_CAP_FILE_SHARED on directory inode, it applies
to all dentries in the directory. Dentries have dir leases are added to
another list. Dentries in the list are periodically checked in a round
robin manner.Signed-off-by: "Yan, Zheng"
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
When pending cap releases fill up one message, start a work to send
cap release message. (old way is sending cap releases every 5 seconds)Signed-off-by: "Yan, Zheng"
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Link: http://tracker.ceph.com/issues/37576
Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
In versioned reply, inodestat, dirstat and lease are encoded with
version, compat_version and struct_len.Based on a patch from Jos Collin .
Link: http://tracker.ceph.com/issues/26936
Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
ceph_getattr() return zero dev ID for head inodes and set dev ID to
snapid directly for snaphost inodes. This is not good because userspace
utilities may consider device ID of 0 as invalid, snapid may conflict
with other device's ID.This patch introduces "snapids to anonymous bdev IDs" map. we create a
new mapping when we see a snapid for the first time. we trim unused
mapping after it is ilde for 5 minutes.Link: http://tracker.ceph.com/issues/22353
Signed-off-by: "Yan, Zheng"
Acked-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov