25 Oct, 2020

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted stuff all over the place (the largest group here is
    Christoph's stat cleanups)"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: remove KSTAT_QUERY_FLAGS
    fs: remove vfs_stat_set_lookup_flags
    fs: move vfs_fstatat out of line
    fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
    fs: remove vfs_statx_fd
    fs: omfs: use kmemdup() rather than kmalloc+memcpy
    [PATCH] reduce boilerplate in fsid handling
    fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
    selftests: mount: add nosymfollow tests
    Add a "nosymfollow" mount option.

    Linus Torvalds
     

12 Oct, 2020

2 commits


19 Sep, 2020

1 commit


05 Aug, 2020

1 commit

  • When doing some testing recently, I hit some page allocation failures
    on mount, when creating the wb_pagevec_pool for the mount. That
    requires 128k (32 contiguous pages), and after thrashing the memory
    during an xfstests run, sometimes that would fail.

    128k for each mount seems like a lot to hold in reserve for a rainy
    day, so let's change this to a global mempool that gets allocated
    when the module is plugged in.

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

03 Aug, 2020

2 commits

  • Drop duplicated words "down" and "the" in fs/ceph/.

    [ idryomov: merge into a single patch ]

    Signed-off-by: Randy Dunlap
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Randy Dunlap
     
  • This will send the caps/read/write/metadata metrics to any available MDS
    once per second, which will be the same as the userland client. It will
    skip the MDS sessions which don't support the metric collection, as the
    MDSs will close socket connections when they get an unknown type
    message.

    We can disable the metric sending via the disable_send_metrics module
    parameter.

    [ jlayton: fix up endianness bug in ceph_mdsc_send_metrics() ]

    URL: https://tracker.ceph.com/issues/43215
    Signed-off-by: Xiubo Li
    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     

30 Mar, 2020

2 commits

  • The MDS is getting a new lock-caching facility that will allow it
    to cache the necessary locks to allow asynchronous directory operations.
    Since the CEPH_CAP_FILE_* caps are currently unused on directories,
    we can repurpose those bits for this purpose.

    When performing an unlink, if we have Fx on the parent directory,
    and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being
    removed is the primary link, then then we can fire off an unlink
    request immediately and don't need to wait on reply before returning.

    In that situation, just fix up the dcache and link count and return
    immediately after issuing the call to the MDS. This does mean that we
    need to hold an extra reference to the inode being unlinked, and extra
    references to the caps to avoid races. Those references are put and
    error handling is done in the r_callback routine.

    If the operation ends up failing, then set a writeback error on the
    directory inode, and the inode itself that can be fetched later by
    an fsync on the dir.

    The behavior of dir caps is slightly different from caps on normal
    files. Because these are just considered an optimization, if the
    session is reconnected, we will not automatically reclaim them. They
    are instead considered lost until we do another synchronous op in the
    parent directory.

    Async dirops are enabled via the "nowsync" mount option, which is
    patterned after the xfs "wsync" mount option. For now, the default
    is "wsync", but eventually we may flip that.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • On my machine (x86_64) this struct is 952 bytes, which gets rounded up
    to 1024 by kmalloc. Move this to a dedicated slabcache, so we can
    allocate them without the extra 72 bytes of overhead per.

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

12 Feb, 2020

2 commits

  • For the old mount API, the module parameters parseing function will
    be called in ceph_mount() and also just after the default posix acl
    flag set, so we can control to enable/disable it via the mount option.

    But for the new mount API, it will call the module parameters
    parseing function before ceph_get_tree(), so the posix acl will always
    be enabled.

    Fixes: 82995cc6c5ae ("libceph, rbd, ceph: convert to use the new mount API")
    Signed-off-by: Xiubo Li
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • syzbot reported that 4fbc0c711b24 ("ceph: remove the extra slashes in
    the server path") had caused a regression where an allocation could be
    done under a spinlock -- compare_mount_options() is called by sget_fc()
    with sb_lock held.

    We don't really need the supplied server path, so canonicalize it
    in place and compare it directly. To make this work, the leading
    slash is kept around and the logic in ceph_real_mount() to skip it
    is restored. CEPH_MSG_CLIENT_SESSION now reports the same (i.e.
    canonicalized) path, with the leading slash of course.

    Fixes: 4fbc0c711b24 ("ceph: remove the extra slashes in the server path")
    Reported-by: syzbot+98704a51af8e3d9425a9@syzkaller.appspotmail.com
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     

09 Feb, 2020

1 commit

  • Pull vfs file system parameter updates from Al Viro:
    "Saner fs_parser.c guts and data structures. The system-wide registry
    of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
    the horror switch() in fs_parse() that would have to grow another case
    every time something got added to that system-wide registry.

    New syntax types can be added by filesystems easily now, and their
    namespace is that of functions - not of system-wide enum members. IOW,
    they can be shared or kept private and if some turn out to be widely
    useful, we can make them common library helpers, etc., without having
    to do anything whatsoever to fs_parse() itself.

    And we already get that kind of requests - the thing that finally
    pushed me into doing that was "oh, and let's add one for timeouts -
    things like 15s or 2h". If some filesystem really wants that, let them
    do it. Without somebody having to play gatekeeper for the variants
    blessed by direct support in fs_parse(), TYVM.

    Quite a bit of boilerplate is gone. And IMO the data structures make a
    lot more sense now. -200LoC, while we are at it"

    * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
    tmpfs: switch to use of invalfc()
    cgroup1: switch to use of errorfc() et.al.
    procfs: switch to use of invalfc()
    hugetlbfs: switch to use of invalfc()
    cramfs: switch to use of errofc() et.al.
    gfs2: switch to use of errorfc() et.al.
    fuse: switch to use errorfc() et.al.
    ceph: use errorfc() and friends instead of spelling the prefix out
    prefix-handling analogues of errorf() and friends
    turn fs_param_is_... into functions
    fs_parse: handle optional arguments sanely
    fs_parse: fold fs_parameter_desc/fs_parameter_spec
    fs_parser: remove fs_parameter_description name field
    add prefix to fs_context->log
    ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
    new primitive: __fs_parse()
    switch rbd and libceph to p_log-based primitives
    struct p_log, variants of warnf() et.al. taking that one instead
    teach logfc() to handle prefices, give it saner calling conventions
    get rid of cg_invalf()
    ...

    Linus Torvalds
     

08 Feb, 2020

6 commits


07 Feb, 2020

2 commits


27 Jan, 2020

3 commits

  • Instead of using the copy-from operation, switch copy_file_range to the
    new copy-from2 operation, which allows to send the truncate_seq and
    truncate_size parameters.

    If an OSD does not support the copy-from2 operation it will return
    -EOPNOTSUPP. In that case, the kernel client will stop trying to do
    remote object copies for this fs client and will always use the generic
    VFS copy_file_range.

    Signed-off-by: Luis Henriques
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Luis Henriques
     
  • It's possible to pass the mount helper a server path that has more
    than one contiguous slash character. For example:

    $ mount -t ceph 192.168.195.165:40176:/// /mnt/cephfs/

    In the MDS server side the extra slashes of the server path will be
    treated as snap dir, and then we can get the following debug logs:

    ceph: mount opening path //
    ceph: open_root_inode opening '//'
    ceph: fill_trace 0000000059b8a3bc is_dentry 0 is_target 1
    ceph: alloc_inode 00000000dc4ca00b
    ceph: get_inode created new inode 00000000dc4ca00b 1.ffffffffffffffff ino 1
    ceph: get_inode on 1=1.ffffffffffffffff got 00000000dc4ca00b

    And then when creating any new file or directory under the mount
    point, we can hit the following BUG_ON in ceph_fill_trace():

    BUG_ON(ceph_snap(dir) != dvino.snap);

    Have the client ignore the extra slashes in the server path when
    mounting. This will also canonicalize the path, so that identical mounts
    can be consilidated.

    1) "//mydir1///mydir//"
    2) "/mydir1/mydir"
    3) "/mydir1/mydir/"

    Regardless of the internal treatment of these paths, the kernel still
    stores the original string including the leading '/' for presentation
    to userland.

    URL: https://tracker.ceph.com/issues/42771
    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     
  • If all the MDS daemons are down for some reason, then the first mount
    attempt will fail with EIO after the mount request times out. A mount
    attempt will also fail with EIO if all of the MDS's are laggy.

    This patch changes the code to return -EHOSTUNREACH in these situations
    and adds a pr_info error message to help the admin determine the cause.

    URL: https://tracker.ceph.com/issues/4386
    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Xiubo Li
     

10 Dec, 2019

1 commit

  • Most of these values should never be negative, so convert them to
    unsigned values. Add some sanity checking to the parsed values, and
    clean up some unneeded casts.

    Note that while caps_max should never be negative, this patch leaves
    it signed, since this value ends up later being compared to a signed
    counter. Just ensure that userland never passes in a negative value
    for caps_max.

    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

28 Nov, 2019

1 commit

  • Convert the ceph filesystem to the new internal mount API as the old
    one will be obsoleted and removed. This allows greater flexibility in
    communication of mount parameters between userspace, the VFS and the
    filesystem.

    See Documentation/filesystems/mount_api.txt for more information.

    [ Numerous string handling, leak and regression fixes; rbd conversion
    was particularly broken and had to be redone almost from scratch. ]

    Signed-off-by: David Howells
    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    David Howells
     

08 Nov, 2019

1 commit


26 Sep, 2019

1 commit

  • Pull ceph updates from Ilya Dryomov:
    "The highlights are:

    - automatic recovery of a blacklisted filesystem session (Zheng Yan).
    This is disabled by default and can be enabled by mounting with the
    new "recover_session=clean" option.

    - serialize buffered reads and O_DIRECT writes (Jeff Layton). Care is
    taken to avoid serializing O_DIRECT reads and writes with each
    other, this is based on the exclusion scheme from NFS.

    - handle large osdmaps better in the face of fragmented memory
    (myself)

    - don't limit what security.* xattrs can be get or set (Jeff Layton).
    We were overly restrictive here, unnecessarily preventing things
    like file capability sets stored in security.capability from
    working.

    - allow copy_file_range() within the same inode and across different
    filesystems within the same cluster (Luis Henriques)"

    * tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client: (41 commits)
    ceph: call ceph_mdsc_destroy from destroy_fs_client
    libceph: use ceph_kvmalloc() for osdmap arrays
    libceph: avoid a __vmalloc() deadlock in ceph_kvmalloc()
    ceph: allow object copies across different filesystems in the same cluster
    ceph: include ceph_debug.h in cache.c
    ceph: move static keyword to the front of declarations
    rbd: pull rbd_img_request_create() dout out into the callers
    ceph: reconnect connection if session hang in opening state
    libceph: drop unused con parameter of calc_target()
    ceph: use release_pages() directly
    rbd: fix response length parameter for encoded strings
    ceph: allow arbitrary security.* xattrs
    ceph: only set CEPH_I_SEC_INITED if we got a MAC label
    ceph: turn ceph_security_invalidate_secctx into static inline
    ceph: add buffered/direct exclusionary locking for reads and writes
    libceph: handle OSD op ceph_pagelist_append() errors
    ceph: don't return a value from void function
    ceph: don't freeze during write page faults
    ceph: update the mtime when truncating up
    ceph: fix indentation in __get_snap_name()
    ...

    Linus Torvalds
     

16 Sep, 2019

4 commits

  • They're always called in succession.

    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Make client use osd reply and session message to infer if itself is
    blacklisted. Client reconnect to cluster using new entity addr if it
    is blacklisted. Auto reconnect is limited to once every 30 minutes.

    Auto reconnect is disabled by default. It can be enabled/disabled by
    recover_session= mount option. In 'clean' mode, client drops
    any dirty data/metadata, invalidates page caches and invalidates all
    writable file handles. After reconnect, file locks become stale because
    MDS loses track of them. If an inode contains any stale file locks,
    read/write on the indoe are not allowed until applications release all
    stale file locks.

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • It closes mds sessions, drop all caps and invalidates page caches,
    then use new entity address to reconnect to the cluster.

    After reconnect, all dirty data/metadata are dropped, file locks
    get lost sliently. Open files continue to work because client will
    try renewing caps on later read/write.

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     

30 Aug, 2019

1 commit

  • Fill in the appropriate limits to avoid inconsistencies
    in the vfs cached inode times when timestamps are
    outside the permitted range.

    According to the disscussion in
    https://patchwork.kernel.org/patch/8308691/ we agreed to use
    unsigned 32 bit timestamps on ceph.
    Update the limits accordingly.

    Signed-off-by: Deepa Dinamani
    Acked-by: Jeff Layton
    Cc: zyan@redhat.com
    Cc: sage@redhat.com
    Cc: idryomov@gmail.com
    Cc: ceph-devel@vger.kernel.org

    Deepa Dinamani
     

19 Jul, 2019

1 commit

  • Pull ceph updates from Ilya Dryomov:
    "Lots of exciting things this time!

    - support for rbd object-map and fast-diff features (myself). This
    will speed up reads, discards and things like snap diffs on sparse
    images.

    - ceph.snap.btime vxattr to expose snapshot creation time (David
    Disseldorp). This will be used to integrate with "Restore Previous
    Versions" feature added in Windows 7 for folks who reexport ceph
    through SMB.

    - security xattrs for ceph (Zheng Yan). Only selinux is supported for
    now due to the limitations of ->dentry_init_security().

    - support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff
    Layton). This is actually a single feature bit which was missing
    because of the filesystem pieces. With this in, the kernel client
    will finally be reported as "luminous" by "ceph features" -- it is
    still being reported as "jewel" even though all required Luminous
    features were implemented in 4.13.

    - stop NULL-terminating ceph vxattrs (Jeff Layton). The convention
    with xattrs is to not terminate and this was causing
    inconsistencies with ceph-fuse.

    - change filesystem time granularity from 1 us to 1 ns, again fixing
    an inconsistency with ceph-fuse (Luis Henriques).

    On top of this there are some additional dentry name handling and cap
    flushing fixes from Zheng. Finally, Jeff is formally taking over for
    Zheng as the filesystem maintainer"

    * tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client: (71 commits)
    ceph: fix end offset in truncate_inode_pages_range call
    ceph: use generic_delete_inode() for ->drop_inode
    ceph: use ceph_evict_inode to cleanup inode's resource
    ceph: initialize superblock s_time_gran to 1
    MAINTAINERS: take over for Zheng as CephFS kernel client maintainer
    rbd: setallochint only if object doesn't exist
    rbd: support for object-map and fast-diff
    rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
    libceph: export osd_req_op_data() macro
    libceph: change ceph_osdc_call() to take page vector for response
    libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
    rbd: new exclusive lock wait/wake code
    rbd: quiescing lock should wait for image requests
    rbd: lock should be quiesced on reacquire
    rbd: introduce copyup state machine
    rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
    rbd: move OSD request allocation into object request state machines
    rbd: factor out __rbd_osd_setup_discard_ops()
    rbd: factor out rbd_osd_setup_copyup()
    rbd: introduce obj_req->osd_reqs list
    ...

    Linus Torvalds
     

13 Jul, 2019

1 commit

  • Pull driver core and debugfs updates from Greg KH:
    "Here is the "big" driver core and debugfs changes for 5.3-rc1

    It's a lot of different patches, all across the tree due to some api
    changes and lots of debugfs cleanups.

    Other than the debugfs cleanups, in this set of changes we have:

    - bus iteration function cleanups

    - scripts/get_abi.pl tool to display and parse Documentation/ABI
    entries in a simple way

    - cleanups to Documenatation/ABI/ entries to make them parse easier
    due to typos and other minor things

    - default_attrs use for some ktype users

    - driver model documentation file conversions to .rst

    - compressed firmware file loading

    - deferred probe fixes

    All of these have been in linux-next for a while, with a bunch of
    merge issues that Stephen has been patient with me for"

    * tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (102 commits)
    debugfs: make error message a bit more verbose
    orangefs: fix build warning from debugfs cleanup patch
    ubifs: fix build warning after debugfs cleanup patch
    driver: core: Allow subsystems to continue deferring probe
    drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT
    arch_topology: Remove error messages on out-of-memory conditions
    lib: notifier-error-inject: no need to check return value of debugfs_create functions
    swiotlb: no need to check return value of debugfs_create functions
    ceph: no need to check return value of debugfs_create functions
    sunrpc: no need to check return value of debugfs_create functions
    ubifs: no need to check return value of debugfs_create functions
    orangefs: no need to check return value of debugfs_create functions
    nfsd: no need to check return value of debugfs_create functions
    lib: 842: no need to check return value of debugfs_create functions
    debugfs: provide pr_fmt() macro
    debugfs: log errors when something goes wrong
    drivers: s390/cio: Fix compilation warning about const qualifiers
    drivers: Add generic helper to match by of_node
    driver_find_device: Unify the match function with class_find_device()
    bus_find_device: Unify the match callback with class_find_device
    ...

    Linus Torvalds
     

08 Jul, 2019

4 commits


03 Jul, 2019

1 commit

  • When calling debugfs functions, there is no need to ever check the
    return value. The function can work or not, but the code logic should
    never do something different based on this.

    This cleanup allows the return value of the functions to be made void,
    as no logic should care if these files succeed or not.

    Cc: "Yan, Zheng"
    Cc: Sage Weil
    Cc: Ilya Dryomov
    Cc: "David S. Miller"
    Cc: ceph-devel@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman
    Link: https://lore.kernel.org/r/20190612145538.GA18772@kroah.com
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

06 Jun, 2019

1 commit

  • We have three workqueue for inode works. Later patch will introduce
    one more work for inode. It's not good to introcuce more workqueue
    and add more 'struct work_struct' to 'struct ceph_inode_info'.

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng