12 Feb, 2020
3 commits
-
For the old mount API, the module parameters parseing function will
be called in ceph_mount() and also just after the default posix acl
flag set, so we can control to enable/disable it via the mount option.But for the new mount API, it will call the module parameters
parseing function before ceph_get_tree(), so the posix acl will always
be enabled.Fixes: 82995cc6c5ae ("libceph, rbd, ceph: convert to use the new mount API")
Signed-off-by: Xiubo Li
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov -
syzbot reported that 4fbc0c711b24 ("ceph: remove the extra slashes in
the server path") had caused a regression where an allocation could be
done under a spinlock -- compare_mount_options() is called by sget_fc()
with sb_lock held.We don't really need the supplied server path, so canonicalize it
in place and compare it directly. To make this work, the leading
slash is kept around and the logic in ceph_real_mount() to skip it
is restored. CEPH_MSG_CLIENT_SESSION now reports the same (i.e.
canonicalized) path, with the leading slash of course.Fixes: 4fbc0c711b24 ("ceph: remove the extra slashes in the server path")
Reported-by: syzbot+98704a51af8e3d9425a9@syzkaller.appspotmail.com
Signed-off-by: Ilya Dryomov
Reviewed-by: Jeff Layton -
In O_APPEND & O_DIRECT mode, the data from different writers will
be possibly overlapping each other since they take the shared lock.For example, both Writer1 and Writer2 are in O_APPEND and O_DIRECT
mode:Writer1 Writer2
shared_lock() shared_lock()
getattr(CAP_SIZE) getattr(CAP_SIZE)
iocb->ki_pos = EOF iocb->ki_pos = EOF
write(data1)
write(data2)
shared_unlock() shared_unlock()The data2 will overlap the data1 from the same file offset, the
old EOF.Switch to exclusive lock instead when O_APPEND is specified.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov
09 Feb, 2020
1 commit
-
Pull vfs file system parameter updates from Al Viro:
"Saner fs_parser.c guts and data structures. The system-wide registry
of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
the horror switch() in fs_parse() that would have to grow another case
every time something got added to that system-wide registry.New syntax types can be added by filesystems easily now, and their
namespace is that of functions - not of system-wide enum members. IOW,
they can be shared or kept private and if some turn out to be widely
useful, we can make them common library helpers, etc., without having
to do anything whatsoever to fs_parse() itself.And we already get that kind of requests - the thing that finally
pushed me into doing that was "oh, and let's add one for timeouts -
things like 15s or 2h". If some filesystem really wants that, let them
do it. Without somebody having to play gatekeeper for the variants
blessed by direct support in fs_parse(), TYVM.Quite a bit of boilerplate is gone. And IMO the data structures make a
lot more sense now. -200LoC, while we are at it"* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
tmpfs: switch to use of invalfc()
cgroup1: switch to use of errorfc() et.al.
procfs: switch to use of invalfc()
hugetlbfs: switch to use of invalfc()
cramfs: switch to use of errofc() et.al.
gfs2: switch to use of errorfc() et.al.
fuse: switch to use errorfc() et.al.
ceph: use errorfc() and friends instead of spelling the prefix out
prefix-handling analogues of errorf() and friends
turn fs_param_is_... into functions
fs_parse: handle optional arguments sanely
fs_parse: fold fs_parameter_desc/fs_parameter_spec
fs_parser: remove fs_parameter_description name field
add prefix to fs_context->log
ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
new primitive: __fs_parse()
switch rbd and libceph to p_log-based primitives
struct p_log, variants of warnf() et.al. taking that one instead
teach logfc() to handle prefices, give it saner calling conventions
get rid of cg_invalf()
...
08 Feb, 2020
6 commits
-
Signed-off-by: Al Viro
-
Don't bother with "mixed" options that would allow both the
form with and without argument (i.e. both -o foo and -o foo=bar).
Rather than trying to shove both into a single fs_parameter_spec,
allow having with-argument and no-argument specs with the same
name and teach fs_parse to handle that.There are very few options of that sort, and they are actually
easier to handle that way - callers end up with less postprocessing.Signed-off-by: Al Viro
-
The former contains nothing but a pointer to an array of the latter...
Signed-off-by: Al Viro
-
Unused now.
Signed-off-by: Eric Sandeen
Acked-by: David Howells
Signed-off-by: Al Viro -
... turning it into struct p_log embedded into fs_context. Initialize
the prefix with fs_type->name, turning fs_parse() into a trivial
inline wrapper for __fs_parse().This makes fs_parameter_description->name completely unused.
Signed-off-by: Al Viro
-
... and now errorf() et.al. are never called with NULL fs_context,
so we can get rid of conditional in those.Signed-off-by: Al Viro
07 Feb, 2020
2 commits
-
no real difference now
Signed-off-by: Al Viro
-
Don't do a single array; attach them to fsparam_enum() entry
instead. And don't bother trying to embed the names into those -
it actually loses memory, with no real speedup worth mentioning.Simplifies validation as well.
Signed-off-by: Al Viro
06 Feb, 2020
1 commit
-
Pull ceph fixes from Ilya Dryomov:
- a set of patches that fixes various corner cases in mount and umount
code (Xiubo Li). This has to do with choosing an MDS, distinguishing
between laggy and down MDSes and parsing the server path.- inode initialization fixes (Jeff Layton). The one included here
mostly concerns things like open_by_handle() and there is another one
that will come through Al.- copy_file_range() now uses the new copy-from2 op (Luis Henriques).
The existing copy-from op turned out to be infeasible for generic
filesystem use; we disable the copy offload if OSDs don't support
copy-from2.- a patch to link "rbd" and "block" devices together in sysfs (Hannes
Reinecke)... and a smattering of cleanups from Xiubo, Jeff and Chengguang.
* tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client: (25 commits)
rbd: set the 'device' link in sysfs
ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
ceph: print name of xattr in __ceph_{get,set}xattr() douts
ceph: print r_direct_hash in hex in __choose_mds() dout
ceph: use copy-from2 op in copy_file_range
ceph: close holes in structs ceph_mds_session and ceph_mds_request
rbd: work around -Wuninitialized warning
ceph: allocate the correct amount of extra bytes for the session features
ceph: rename get_session and switch to use ceph_get_mds_session
ceph: remove the extra slashes in the server path
ceph: add possible_max_rank and make the code more readable
ceph: print dentry offset in hex and fix xattr_version type
ceph: only touch the caps which have the subset mask requested
ceph: don't clear I_NEW until inode metadata is fully populated
ceph: retry the same mds later after the new session is opened
ceph: check availability of mds cluster on mount after wait timeout
ceph: keep the session state until it is released
ceph: add __send_request helper
ceph: ensure we have a new cap before continuing in fill_inode
ceph: drop unused ttl_from parameter from fill_inode
...
05 Feb, 2020
1 commit
-
Pull vfs timestamp updates from Al Viro:
"More 64bit timestamp work"* 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kernfs: don't bother with timestamp truncation
fs: Do not overload update_time
fs: Delete timespec64_trunc()
fs: ubifs: Eliminate timespec64_trunc() usage
fs: ceph: Delete timespec64_trunc() usage
fs: cifs: Delete usage of timespec64_trunc
fs: fat: Eliminate timespec64_trunc() usage
utimes: Clamp the timestamps in notify_change()
27 Jan, 2020
23 commits
-
All of these functions are only called from CephFS, so move them into
ceph.ko, and drop the exports.Signed-off-by: Jeff Layton
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov -
Signed-off-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
It's hard to read, especially when it is:
ceph: __choose_mds 00000000b7bc9c15 is_hash=1 (-271041095) mode 0
At the same time, switch to __func__ to get rid of the checkpatch
warning.Signed-off-by: Xiubo Li
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Instead of using the copy-from operation, switch copy_file_range to the
new copy-from2 operation, which allows to send the truncate_seq and
truncate_size parameters.If an OSD does not support the copy-from2 operation it will return
-EOPNOTSUPP. In that case, the kernel client will stop trying to do
remote object copies for this fs client and will always use the generic
VFS copy_file_range.Signed-off-by: Luis Henriques
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Move s_ref up to plug a 4 byte hole, which plugs another.
Move r_kref to shave 8 bytes off per request on x86_64.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
The total bytes may potentially be larger than 8.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Just in case the session's refcount reach 0 and is releasing, and
if we get the session without checking it, we may encounter kernel
crash.Rename get_session to ceph_get_mds_session and make it global.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
It's possible to pass the mount helper a server path that has more
than one contiguous slash character. For example:$ mount -t ceph 192.168.195.165:40176:/// /mnt/cephfs/
In the MDS server side the extra slashes of the server path will be
treated as snap dir, and then we can get the following debug logs:ceph: mount opening path //
ceph: open_root_inode opening '//'
ceph: fill_trace 0000000059b8a3bc is_dentry 0 is_target 1
ceph: alloc_inode 00000000dc4ca00b
ceph: get_inode created new inode 00000000dc4ca00b 1.ffffffffffffffff ino 1
ceph: get_inode on 1=1.ffffffffffffffff got 00000000dc4ca00bAnd then when creating any new file or directory under the mount
point, we can hit the following BUG_ON in ceph_fill_trace():BUG_ON(ceph_snap(dir) != dvino.snap);
Have the client ignore the extra slashes in the server path when
mounting. This will also canonicalize the path, so that identical mounts
can be consilidated.1) "//mydir1///mydir//"
2) "/mydir1/mydir"
3) "/mydir1/mydir/"Regardless of the internal treatment of these paths, the kernel still
stores the original string including the leading '/' for presentation
to userland.URL: https://tracker.ceph.com/issues/42771
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
The m_num_mds here is actually the number for MDSs which are in
up:active status, and it will be duplicated to m_num_active_mds,
so remove it.Add possible_max_rank to the mdsmap struct and this will be
the correctly possible largest rank boundary.Remove the special case for one mds in __mdsmap_get_random_mds(),
because the validate mds rank may not always be 0.Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
In the debug logs about the di->offset or ctx->pos it is in hex
format, but some others are using the dec format. It is a little
hard to read.For the xattr version, it is u64 type, using a shorter type may
truncate it.Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
For the caps having no any subset mask requested we shouldn't touch
them.Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Currently, we could have an open-by-handle (or NFS server) call
into the filesystem and start working with an inode before it's
properly filled out.Don't clear I_NEW until we have filled out the inode, and discard it
properly if that fails. Note that we occasionally take an extra
reference to the inode to ensure that we don't put the last reference in
discard_new_inode, but rather leave it for ceph_async_iput.Signed-off-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
If max_mds > 1 and a request is submitted that chooses a random mds
rank, and the relating session is not opened yet, the request will wait
until the session has been opened and resend again.Every time the request goes through __do_request, it will release the
req->session first and choose a random one again, which may be a
completely different rank than the one it just waited on.In the worst case, it will open all the mds sessions one by one just
before the request can be successfully sent out.Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
If all the MDS daemons are down for some reason, then the first mount
attempt will fail with EIO after the mount request times out. A mount
attempt will also fail with EIO if all of the MDS's are laggy.This patch changes the code to return -EHOSTUNREACH in these situations
and adds a pr_info error message to help the admin determine the cause.URL: https://tracker.ceph.com/issues/4386
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
When reconnecting the session but if it is denied by the MDS due
to client was in blacklist or something else, kclient will receive
a session close reply, and we will never see the important log:"ceph: mds%d reconnect denied"
And with the confusing log:
"ceph: handle_session mds0 close 0000000085804730 state ??? seq 0"
Let's keep the session state until its memories is released.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
If the caller passes in a NULL cap_reservation, and we can't allocate
one then ensure that we fail gracefully.Signed-off-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Signed-off-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
During umount, if there has no any unsafe request in the mdsc and
some requests still in-flight and not got reply yet, and if the
rest requets are all safe ones, after that even all of them in mdsc
are unregistered, the umount must wait until after mount_timeout
seconds anyway.Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
Even the MDS is in up:active state, but it also maybe laggy. Here
will skip the laggy MDSs.Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
In case the max_mds > 1 in MDS cluster and there is no any standby
MDS and all the max_mds MDSs are in up:active state, if one of the
up:active MDSs is dead, the m->m_num_laggy in kclient will be 1.
Then the mount will fail without considering other healthy MDSs.There manybe some MDSs still "in" the cluster but not in up:active
state, we will ignore them. Only when all the up:active MDSs in
the cluster are laggy will treat the cluster as not be available.In case decreasing the max_mds, the cluster will not stop the extra
up:active MDSs immediately and there will be a latency. During it
the up:active MDS number will be larger than the max_mds, so later
the m_info memories will 100% be reallocated.Here will pick out the up:active MDSs as the m_num_mds and allocate
the needed memories once.Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
ceph_pagelist_encode_string() will not fail in reserved case,
also, we do not check err code here, so remove unnecessary
assignment.Signed-off-by: Chengguang Xu
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
We print session's refcount in debug message inside
ceph_put_mds_session() and get_session(), so we don't have to
print it in con_get()/__ceph_lookup_mds_session()/con_put().Signed-off-by: Chengguang Xu
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
22 Jan, 2020
1 commit
-
Currently, we just assume that it will stick around by virtue of the
submitter's reference, but later patches will allow the syscall to
return early and we can't rely on that reference at that point.While I'm not aware of any reports of it, Xiubo pointed out that this
may fix a use-after-free. If the wait for a reply times out or is
canceled via signal, and then the reply comes in after the syscall
returns, the client can end up trying to access r_parent without a
reference.Take an extra reference to the inode when setting r_parent and release
it when releasing the request.Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
10 Dec, 2019
2 commits
-
Show the laggy state.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov -
__ceph_is_any_caps is a duplicate helper.
Signed-off-by: Xiubo Li
Reviewed-by: Jeff Layton
Signed-off-by: Ilya Dryomov