12 Feb, 2020

3 commits

  • For the old mount API, the module parameters parseing function will
    be called in ceph_mount() and also just after the default posix acl
    flag set, so we can control to enable/disable it via the mount option.

    But for the new mount API, it will call the module parameters
    parseing function before ceph_get_tree(), so the posix acl will always
    be enabled.

    Fixes: 82995cc6c5ae ("libceph, rbd, ceph: convert to use the new mount API")
    Signed-off-by: Xiubo Li
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • syzbot reported that 4fbc0c711b24 ("ceph: remove the extra slashes in
    the server path") had caused a regression where an allocation could be
    done under a spinlock -- compare_mount_options() is called by sget_fc()
    with sb_lock held.

    We don't really need the supplied server path, so canonicalize it
    in place and compare it directly. To make this work, the leading
    slash is kept around and the logic in ceph_real_mount() to skip it
    is restored. CEPH_MSG_CLIENT_SESSION now reports the same (i.e.
    canonicalized) path, with the leading slash of course.

    Fixes: 4fbc0c711b24 ("ceph: remove the extra slashes in the server path")
    Reported-by: syzbot+98704a51af8e3d9425a9@syzkaller.appspotmail.com
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     
  • In O_APPEND & O_DIRECT mode, the data from different writers will
    be possibly overlapping each other since they take the shared lock.

    For example, both Writer1 and Writer2 are in O_APPEND and O_DIRECT
    mode:

    Writer1 Writer2

    shared_lock() shared_lock()
    getattr(CAP_SIZE) getattr(CAP_SIZE)
    iocb->ki_pos = EOF iocb->ki_pos = EOF
    write(data1)
    write(data2)
    shared_unlock() shared_unlock()

    The data2 will overlap the data1 from the same file offset, the
    old EOF.

    Switch to exclusive lock instead when O_APPEND is specified.

    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     

09 Feb, 2020

1 commit

  • Pull vfs file system parameter updates from Al Viro:
    "Saner fs_parser.c guts and data structures. The system-wide registry
    of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
    the horror switch() in fs_parse() that would have to grow another case
    every time something got added to that system-wide registry.

    New syntax types can be added by filesystems easily now, and their
    namespace is that of functions - not of system-wide enum members. IOW,
    they can be shared or kept private and if some turn out to be widely
    useful, we can make them common library helpers, etc., without having
    to do anything whatsoever to fs_parse() itself.

    And we already get that kind of requests - the thing that finally
    pushed me into doing that was "oh, and let's add one for timeouts -
    things like 15s or 2h". If some filesystem really wants that, let them
    do it. Without somebody having to play gatekeeper for the variants
    blessed by direct support in fs_parse(), TYVM.

    Quite a bit of boilerplate is gone. And IMO the data structures make a
    lot more sense now. -200LoC, while we are at it"

    * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
    tmpfs: switch to use of invalfc()
    cgroup1: switch to use of errorfc() et.al.
    procfs: switch to use of invalfc()
    hugetlbfs: switch to use of invalfc()
    cramfs: switch to use of errofc() et.al.
    gfs2: switch to use of errorfc() et.al.
    fuse: switch to use errorfc() et.al.
    ceph: use errorfc() and friends instead of spelling the prefix out
    prefix-handling analogues of errorf() and friends
    turn fs_param_is_... into functions
    fs_parse: handle optional arguments sanely
    fs_parse: fold fs_parameter_desc/fs_parameter_spec
    fs_parser: remove fs_parameter_description name field
    add prefix to fs_context->log
    ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
    new primitive: __fs_parse()
    switch rbd and libceph to p_log-based primitives
    struct p_log, variants of warnf() et.al. taking that one instead
    teach logfc() to handle prefices, give it saner calling conventions
    get rid of cg_invalf()
    ...

    Linus Torvalds
     

08 Feb, 2020

6 commits


07 Feb, 2020

2 commits


06 Feb, 2020

1 commit

  • Pull ceph fixes from Ilya Dryomov:

    - a set of patches that fixes various corner cases in mount and umount
    code (Xiubo Li). This has to do with choosing an MDS, distinguishing
    between laggy and down MDSes and parsing the server path.

    - inode initialization fixes (Jeff Layton). The one included here
    mostly concerns things like open_by_handle() and there is another one
    that will come through Al.

    - copy_file_range() now uses the new copy-from2 op (Luis Henriques).
    The existing copy-from op turned out to be infeasible for generic
    filesystem use; we disable the copy offload if OSDs don't support
    copy-from2.

    - a patch to link "rbd" and "block" devices together in sysfs (Hannes
    Reinecke)

    ... and a smattering of cleanups from Xiubo, Jeff and Chengguang.

    * tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client: (25 commits)
    rbd: set the 'device' link in sysfs
    ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
    ceph: print name of xattr in __ceph_{get,set}xattr() douts
    ceph: print r_direct_hash in hex in __choose_mds() dout
    ceph: use copy-from2 op in copy_file_range
    ceph: close holes in structs ceph_mds_session and ceph_mds_request
    rbd: work around -Wuninitialized warning
    ceph: allocate the correct amount of extra bytes for the session features
    ceph: rename get_session and switch to use ceph_get_mds_session
    ceph: remove the extra slashes in the server path
    ceph: add possible_max_rank and make the code more readable
    ceph: print dentry offset in hex and fix xattr_version type
    ceph: only touch the caps which have the subset mask requested
    ceph: don't clear I_NEW until inode metadata is fully populated
    ceph: retry the same mds later after the new session is opened
    ceph: check availability of mds cluster on mount after wait timeout
    ceph: keep the session state until it is released
    ceph: add __send_request helper
    ceph: ensure we have a new cap before continuing in fill_inode
    ceph: drop unused ttl_from parameter from fill_inode
    ...

    Linus Torvalds
     

05 Feb, 2020

1 commit

  • Pull vfs timestamp updates from Al Viro:
    "More 64bit timestamp work"

    * 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    kernfs: don't bother with timestamp truncation
    fs: Do not overload update_time
    fs: Delete timespec64_trunc()
    fs: ubifs: Eliminate timespec64_trunc() usage
    fs: ceph: Delete timespec64_trunc() usage
    fs: cifs: Delete usage of timespec64_trunc
    fs: fat: Eliminate timespec64_trunc() usage
    utimes: Clamp the timestamps in notify_change()

    Linus Torvalds
     

27 Jan, 2020

23 commits

  • All of these functions are only called from CephFS, so move them into
    ceph.ko, and drop the exports.

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • It's hard to read, especially when it is:

    ceph: __choose_mds 00000000b7bc9c15 is_hash=1 (-271041095) mode 0

    At the same time, switch to __func__ to get rid of the checkpatch
    warning.

    Signed-off-by: Xiubo Li
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • Instead of using the copy-from operation, switch copy_file_range to the
    new copy-from2 operation, which allows to send the truncate_seq and
    truncate_size parameters.

    If an OSD does not support the copy-from2 operation it will return
    -EOPNOTSUPP. In that case, the kernel client will stop trying to do
    remote object copies for this fs client and will always use the generic
    VFS copy_file_range.

    Signed-off-by: Luis Henriques
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Luis Henriques
     
  • Move s_ref up to plug a 4 byte hole, which plugs another.
    Move r_kref to shave 8 bytes off per request on x86_64.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • The total bytes may potentially be larger than 8.

    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • Just in case the session's refcount reach 0 and is releasing, and
    if we get the session without checking it, we may encounter kernel
    crash.

    Rename get_session to ceph_get_mds_session and make it global.

    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • It's possible to pass the mount helper a server path that has more
    than one contiguous slash character. For example:

    $ mount -t ceph 192.168.195.165:40176:/// /mnt/cephfs/

    In the MDS server side the extra slashes of the server path will be
    treated as snap dir, and then we can get the following debug logs:

    ceph: mount opening path //
    ceph: open_root_inode opening '//'
    ceph: fill_trace 0000000059b8a3bc is_dentry 0 is_target 1
    ceph: alloc_inode 00000000dc4ca00b
    ceph: get_inode created new inode 00000000dc4ca00b 1.ffffffffffffffff ino 1
    ceph: get_inode on 1=1.ffffffffffffffff got 00000000dc4ca00b

    And then when creating any new file or directory under the mount
    point, we can hit the following BUG_ON in ceph_fill_trace():

    BUG_ON(ceph_snap(dir) != dvino.snap);

    Have the client ignore the extra slashes in the server path when
    mounting. This will also canonicalize the path, so that identical mounts
    can be consilidated.

    1) "//mydir1///mydir//"
    2) "/mydir1/mydir"
    3) "/mydir1/mydir/"

    Regardless of the internal treatment of these paths, the kernel still
    stores the original string including the leading '/' for presentation
    to userland.

    URL: https://tracker.ceph.com/issues/42771
    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • The m_num_mds here is actually the number for MDSs which are in
    up:active status, and it will be duplicated to m_num_active_mds,
    so remove it.

    Add possible_max_rank to the mdsmap struct and this will be
    the correctly possible largest rank boundary.

    Remove the special case for one mds in __mdsmap_get_random_mds(),
    because the validate mds rank may not always be 0.

    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • In the debug logs about the di->offset or ctx->pos it is in hex
    format, but some others are using the dec format. It is a little
    hard to read.

    For the xattr version, it is u64 type, using a shorter type may
    truncate it.

    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • For the caps having no any subset mask requested we shouldn't touch
    them.

    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • Currently, we could have an open-by-handle (or NFS server) call
    into the filesystem and start working with an inode before it's
    properly filled out.

    Don't clear I_NEW until we have filled out the inode, and discard it
    properly if that fails. Note that we occasionally take an extra
    reference to the inode to ensure that we don't put the last reference in
    discard_new_inode, but rather leave it for ceph_async_iput.

    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • If max_mds > 1 and a request is submitted that chooses a random mds
    rank, and the relating session is not opened yet, the request will wait
    until the session has been opened and resend again.

    Every time the request goes through __do_request, it will release the
    req->session first and choose a random one again, which may be a
    completely different rank than the one it just waited on.

    In the worst case, it will open all the mds sessions one by one just
    before the request can be successfully sent out.

    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • If all the MDS daemons are down for some reason, then the first mount
    attempt will fail with EIO after the mount request times out. A mount
    attempt will also fail with EIO if all of the MDS's are laggy.

    This patch changes the code to return -EHOSTUNREACH in these situations
    and adds a pr_info error message to help the admin determine the cause.

    URL: https://tracker.ceph.com/issues/4386
    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • When reconnecting the session but if it is denied by the MDS due
    to client was in blacklist or something else, kclient will receive
    a session close reply, and we will never see the important log:

    "ceph: mds%d reconnect denied"

    And with the confusing log:

    "ceph: handle_session mds0 close 0000000085804730 state ??? seq 0"

    Let's keep the session state until its memories is released.

    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • If the caller passes in a NULL cap_reservation, and we can't allocate
    one then ensure that we fail gracefully.

    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • During umount, if there has no any unsafe request in the mdsc and
    some requests still in-flight and not got reply yet, and if the
    rest requets are all safe ones, after that even all of them in mdsc
    are unregistered, the umount must wait until after mount_timeout
    seconds anyway.

    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • Even the MDS is in up:active state, but it also maybe laggy. Here
    will skip the laggy MDSs.

    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • In case the max_mds > 1 in MDS cluster and there is no any standby
    MDS and all the max_mds MDSs are in up:active state, if one of the
    up:active MDSs is dead, the m->m_num_laggy in kclient will be 1.
    Then the mount will fail without considering other healthy MDSs.

    There manybe some MDSs still "in" the cluster but not in up:active
    state, we will ignore them. Only when all the up:active MDSs in
    the cluster are laggy will treat the cluster as not be available.

    In case decreasing the max_mds, the cluster will not stop the extra
    up:active MDSs immediately and there will be a latency. During it
    the up:active MDS number will be larger than the max_mds, so later
    the m_info memories will 100% be reallocated.

    Here will pick out the up:active MDSs as the m_num_mds and allocate
    the needed memories once.

    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • ceph_pagelist_encode_string() will not fail in reserved case,
    also, we do not check err code here, so remove unnecessary
    assignment.

    Signed-off-by: Chengguang Xu
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Chengguang Xu
     
  • We print session's refcount in debug message inside
    ceph_put_mds_session() and get_session(), so we don't have to
    print it in con_get()/__ceph_lookup_mds_session()/con_put().

    Signed-off-by: Chengguang Xu
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Chengguang Xu
     

22 Jan, 2020

1 commit

  • Currently, we just assume that it will stick around by virtue of the
    submitter's reference, but later patches will allow the syscall to
    return early and we can't rely on that reference at that point.

    While I'm not aware of any reports of it, Xiubo pointed out that this
    may fix a use-after-free. If the wait for a reply times out or is
    canceled via signal, and then the reply comes in after the syscall
    returns, the client can end up trying to access r_parent without a
    reference.

    Take an extra reference to the inode when setting r_parent and release
    it when releasing the request.

    Cc: stable@vger.kernel.org
    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

10 Dec, 2019

2 commits