23 Mar, 2020

2 commits

  • Make it so that CEPH_MSG_DATA_PAGES data item can own pages,
    fixing a bunch of memory leaks for a page vector allocated in
    alloc_msg_with_page_vector(). Currently, only watch-notify
    messages trigger this allocation, and normally the page vector
    is freed either in handle_watch_notify() or by the caller of
    ceph_osdc_notify(). But if the message is freed before that
    (e.g. if the session faults while reading in the message or
    if the notify is stale), we leak the page vector.

    This was supposed to be fixed by switching to a message-owned
    pagelist, but that never happened.

    Fixes: 1907920324f1 ("libceph: support for sending notifies")
    Reported-by: Roman Penyaev
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Roman Penyaev

    Ilya Dryomov
     
  • CEPH_OSDMAP_FULL/NEARFULL aren't set since mimic, so we need to consult
    per-pool flags as well. Unfortunately the backwards compatibility here
    is lacking:

    - the change that deprecated OSDMAP_FULL/NEARFULL went into mimic, but
    was guarded by require_osd_release >= RELEASE_LUMINOUS
    - it was subsequently backported to luminous in v12.2.2, but that makes
    no difference to clients that only check OSDMAP_FULL/NEARFULL because
    require_osd_release is not client-facing -- it is for OSDs

    Since all kernels are affected, the best we can do here is just start
    checking both map flags and pool flags and send that to stable.

    These checks are best effort, so take osdc->lock and look up pool flags
    just once. Remove the FIXME, since filesystem quotas are checked above
    and RADOS quotas are reflected in POOL_FLAG_FULL: when the pool reaches
    its quota, both POOL_FLAG_FULL and POOL_FLAG_FULL_QUOTA are set.

    Cc: stable@vger.kernel.org
    Reported-by: Yanhu Cao
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton
    Acked-by: Sage Weil

    Ilya Dryomov
     

09 Feb, 2020

1 commit

  • Pull vfs file system parameter updates from Al Viro:
    "Saner fs_parser.c guts and data structures. The system-wide registry
    of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
    the horror switch() in fs_parse() that would have to grow another case
    every time something got added to that system-wide registry.

    New syntax types can be added by filesystems easily now, and their
    namespace is that of functions - not of system-wide enum members. IOW,
    they can be shared or kept private and if some turn out to be widely
    useful, we can make them common library helpers, etc., without having
    to do anything whatsoever to fs_parse() itself.

    And we already get that kind of requests - the thing that finally
    pushed me into doing that was "oh, and let's add one for timeouts -
    things like 15s or 2h". If some filesystem really wants that, let them
    do it. Without somebody having to play gatekeeper for the variants
    blessed by direct support in fs_parse(), TYVM.

    Quite a bit of boilerplate is gone. And IMO the data structures make a
    lot more sense now. -200LoC, while we are at it"

    * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
    tmpfs: switch to use of invalfc()
    cgroup1: switch to use of errorfc() et.al.
    procfs: switch to use of invalfc()
    hugetlbfs: switch to use of invalfc()
    cramfs: switch to use of errofc() et.al.
    gfs2: switch to use of errorfc() et.al.
    fuse: switch to use errorfc() et.al.
    ceph: use errorfc() and friends instead of spelling the prefix out
    prefix-handling analogues of errorf() and friends
    turn fs_param_is_... into functions
    fs_parse: handle optional arguments sanely
    fs_parse: fold fs_parameter_desc/fs_parameter_spec
    fs_parser: remove fs_parameter_description name field
    add prefix to fs_context->log
    ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
    new primitive: __fs_parse()
    switch rbd and libceph to p_log-based primitives
    struct p_log, variants of warnf() et.al. taking that one instead
    teach logfc() to handle prefices, give it saner calling conventions
    get rid of cg_invalf()
    ...

    Linus Torvalds
     

08 Feb, 2020

5 commits


07 Feb, 2020

1 commit

  • Don't do a single array; attach them to fsparam_enum() entry
    instead. And don't bother trying to embed the names into those -
    it actually loses memory, with no real speedup worth mentioning.

    Simplifies validation as well.

    Signed-off-by: Al Viro

    Al Viro
     

27 Jan, 2020

2 commits

  • All of these functions are only called from CephFS, so move them into
    ceph.ko, and drop the exports.

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Instead of using the copy-from operation, switch copy_file_range to the
    new copy-from2 operation, which allows to send the truncate_seq and
    truncate_size parameters.

    If an OSD does not support the copy-from2 operation it will return
    -EOPNOTSUPP. In that case, the kernel client will stop trying to do
    remote object copies for this fs client and will always use the generic
    VFS copy_file_range.

    Signed-off-by: Luis Henriques
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Luis Henriques
     

28 Nov, 2019

1 commit

  • Convert the ceph filesystem to the new internal mount API as the old
    one will be obsoleted and removed. This allows greater flexibility in
    communication of mount parameters between userspace, the VFS and the
    filesystem.

    See Documentation/filesystems/mount_api.txt for more information.

    [ Numerous string handling, leak and regression fixes; rbd conversion
    was particularly broken and had to be redone almost from scratch. ]

    Signed-off-by: David Howells
    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    David Howells
     

25 Nov, 2019

1 commit


16 Sep, 2019

6 commits

  • osdmap has a bunch of arrays that grow linearly with the number of
    OSDs. osd_state, osd_weight and osd_primary_affinity take 4 bytes per
    OSD. osd_addr takes 136 bytes per OSD because of sockaddr_storage.
    The CRUSH workspace area also grows linearly with the number of OSDs.

    Normally these arrays are allocated at client startup. The osdmap is
    usually updated in small incrementals, but once in a while a full map
    may need to be processed. For a cluster with 10000 OSDs, this means
    a bunch of 40K allocations followed by a 1.3M allocation, all of which
    are currently required to be physically contiguous. This results in
    sporadic ENOMEM errors, hanging the client.

    Go back to manually (re)allocating arrays and use ceph_kvmalloc() to
    fall back to non-contiguous allocation when necessary.

    Link: https://tracker.ceph.com/issues/40481
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     
  • The vmalloc allocator doesn't fully respect the specified gfp mask:
    while the actual pages are allocated as requested, the page table pages
    are always allocated with GFP_KERNEL. ceph_kvmalloc() may be called
    with GFP_NOFS and GFP_NOIO (for ceph and rbd respectively), so this may
    result in a deadlock.

    There is no real reason for the current PAGE_ALLOC_COSTLY_ORDER logic,
    it's just something that seemed sensible at the time (ceph_kvmalloc()
    predates kvmalloc()). kvmalloc() is smarter: in an attempt to reduce
    long term fragmentation, it first tries to kmalloc non-disruptively.

    Switch to kvmalloc() and set the respective PF_MEMALLOC_* flag using
    the scope API to avoid the deadlock. Note that kvmalloc() needs to be
    passed GFP_KERNEL to enable the fallback.

    Signed-off-by: Ilya Dryomov
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     
  • This bit was omitted from a561372405cf ("libceph: fix PG split vs OSD
    (re)connect race") to avoid backport conflicts.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • osd_req_op_cls_init() and osd_req_op_xattr_init() currently propagate
    ceph_pagelist_alloc() ENOMEM errors but ignore ceph_pagelist_append()
    memory allocation failures. Add these checks and cleanup on error.

    Signed-off-by: David Disseldorp
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    David Disseldorp
     
  • Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • This function also re-open connections to OSD/MON, and re-send in-flight
    OSD requests after re-opening connections to OSD.

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     

28 Aug, 2019

1 commit

  • In set_secret(), key->tfm is assigned to NULL on line 55, and then
    ceph_crypto_key_destroy(key) is executed.

    ceph_crypto_key_destroy(key)
    crypto_free_sync_skcipher(key->tfm)
    crypto_free_skcipher(&tfm->base);

    This happens to work because crypto_sync_skcipher is a trivial wrapper
    around crypto_skcipher: &tfm->base is still 0 and crypto_free_skcipher()
    handles that. Let's not rely on the layout of crypto_sync_skcipher.

    This bug is found by a static analysis tool STCheck written by us.

    Fixes: 69d6302b65a8 ("libceph: Remove VLA usage of skcipher").
    Signed-off-by: Jia-Ju Bai
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jia-Ju Bai
     

22 Aug, 2019

1 commit

  • We can't rely on ->peer_features in calc_target() because it may be
    called both when the OSD session is established and open and when it's
    not. ->peer_features is not valid unless the OSD session is open. If
    this happens on a PG split (pg_num increase), that could mean we don't
    resend a request that should have been resent, hanging the client
    indefinitely.

    In userspace this was fixed by looking at require_osd_release and
    get_xinfo[osd].features fields of the osdmap. However these fields
    belong to the OSD section of the osdmap, which the kernel doesn't
    decode (only the client section is decoded).

    Instead, let's drop this feature check. It effectively checks for
    luminous, so only pre-luminous OSDs would be affected in that on a PG
    split the kernel might resend a request that should not have been
    resent. Duplicates can occur in other scenarios, so both sides should
    already be prepared for them: see dup/replay logic on the OSD side and
    retry_attempt check on the client side.

    Cc: stable@vger.kernel.org
    Fixes: 7de030d6b10a ("libceph: resend on PG splits if OSD has RESEND_ON_SPLIT")
    Link: https://tracker.ceph.com/issues/41162
    Reported-by: Jerry Lee
    Signed-off-by: Ilya Dryomov
    Tested-by: Jerry Lee
    Reviewed-by: Jeff Layton

    Ilya Dryomov
     

19 Jul, 2019

1 commit

  • Pull ceph updates from Ilya Dryomov:
    "Lots of exciting things this time!

    - support for rbd object-map and fast-diff features (myself). This
    will speed up reads, discards and things like snap diffs on sparse
    images.

    - ceph.snap.btime vxattr to expose snapshot creation time (David
    Disseldorp). This will be used to integrate with "Restore Previous
    Versions" feature added in Windows 7 for folks who reexport ceph
    through SMB.

    - security xattrs for ceph (Zheng Yan). Only selinux is supported for
    now due to the limitations of ->dentry_init_security().

    - support for MSG_ADDR2, FS_BTIME and FS_CHANGE_ATTR features (Jeff
    Layton). This is actually a single feature bit which was missing
    because of the filesystem pieces. With this in, the kernel client
    will finally be reported as "luminous" by "ceph features" -- it is
    still being reported as "jewel" even though all required Luminous
    features were implemented in 4.13.

    - stop NULL-terminating ceph vxattrs (Jeff Layton). The convention
    with xattrs is to not terminate and this was causing
    inconsistencies with ceph-fuse.

    - change filesystem time granularity from 1 us to 1 ns, again fixing
    an inconsistency with ceph-fuse (Luis Henriques).

    On top of this there are some additional dentry name handling and cap
    flushing fixes from Zheng. Finally, Jeff is formally taking over for
    Zheng as the filesystem maintainer"

    * tag 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client: (71 commits)
    ceph: fix end offset in truncate_inode_pages_range call
    ceph: use generic_delete_inode() for ->drop_inode
    ceph: use ceph_evict_inode to cleanup inode's resource
    ceph: initialize superblock s_time_gran to 1
    MAINTAINERS: take over for Zheng as CephFS kernel client maintainer
    rbd: setallochint only if object doesn't exist
    rbd: support for object-map and fast-diff
    rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
    libceph: export osd_req_op_data() macro
    libceph: change ceph_osdc_call() to take page vector for response
    libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
    rbd: new exclusive lock wait/wake code
    rbd: quiescing lock should wait for image requests
    rbd: lock should be quiesced on reacquire
    rbd: introduce copyup state machine
    rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
    rbd: move OSD request allocation into object request state machines
    rbd: factor out __rbd_osd_setup_discard_ops()
    rbd: factor out rbd_osd_setup_copyup()
    rbd: introduce obj_req->osd_reqs list
    ...

    Linus Torvalds
     

13 Jul, 2019

1 commit

  • Pull driver core and debugfs updates from Greg KH:
    "Here is the "big" driver core and debugfs changes for 5.3-rc1

    It's a lot of different patches, all across the tree due to some api
    changes and lots of debugfs cleanups.

    Other than the debugfs cleanups, in this set of changes we have:

    - bus iteration function cleanups

    - scripts/get_abi.pl tool to display and parse Documentation/ABI
    entries in a simple way

    - cleanups to Documenatation/ABI/ entries to make them parse easier
    due to typos and other minor things

    - default_attrs use for some ktype users

    - driver model documentation file conversions to .rst

    - compressed firmware file loading

    - deferred probe fixes

    All of these have been in linux-next for a while, with a bunch of
    merge issues that Stephen has been patient with me for"

    * tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (102 commits)
    debugfs: make error message a bit more verbose
    orangefs: fix build warning from debugfs cleanup patch
    ubifs: fix build warning after debugfs cleanup patch
    driver: core: Allow subsystems to continue deferring probe
    drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT
    arch_topology: Remove error messages on out-of-memory conditions
    lib: notifier-error-inject: no need to check return value of debugfs_create functions
    swiotlb: no need to check return value of debugfs_create functions
    ceph: no need to check return value of debugfs_create functions
    sunrpc: no need to check return value of debugfs_create functions
    ubifs: no need to check return value of debugfs_create functions
    orangefs: no need to check return value of debugfs_create functions
    nfsd: no need to check return value of debugfs_create functions
    lib: 842: no need to check return value of debugfs_create functions
    debugfs: provide pr_fmt() macro
    debugfs: log errors when something goes wrong
    drivers: s390/cio: Fix compilation warning about const qualifiers
    drivers: Add generic helper to match by of_node
    driver_find_device: Unify the match function with class_find_device()
    bus_find_device: Unify the match callback with class_find_device
    ...

    Linus Torvalds
     

11 Jul, 2019

1 commit

  • …el/git/dhowells/linux-fs"

    This reverts merge 0f75ef6a9cff49ff612f7ce0578bced9d0b38325 (and thus
    effectively commits

    7a1ade847596 ("keys: Provide KEYCTL_GRANT_PERMISSION")
    2e12256b9a76 ("keys: Replace uid/gid/perm permissions checking with an ACL")

    that the merge brought in).

    It turns out that it breaks booting with an encrypted volume, and Eric
    biggers reports that it also breaks the fscrypt tests [1] and loading of
    in-kernel X.509 certificates [2].

    The root cause of all the breakage is likely the same, but David Howells
    is off email so rather than try to work it out it's getting reverted in
    order to not impact the rest of the merge window.

    [1] https://lore.kernel.org/lkml/20190710011559.GA7973@sol.localdomain/
    [2] https://lore.kernel.org/lkml/20190710013225.GB7973@sol.localdomain/

    Link: https://lore.kernel.org/lkml/CAHk-=wjxoeMJfeBahnWH=9zShKp2bsVy527vo3_y8HfOdhwAAw@mail.gmail.com/
    Reported-by: Eric Biggers <ebiggers@kernel.org>
    Cc: David Howells <dhowells@redhat.com>
    Cc: James Morris <jmorris@namei.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Linus Torvalds
     

09 Jul, 2019

2 commits

  • Pull keyring ACL support from David Howells:
    "This changes the permissions model used by keys and keyrings to be
    based on an internal ACL by the following means:

    - Replace the permissions mask internally with an ACL that contains a
    list of ACEs, each with a specific subject with a permissions mask.
    Potted default ACLs are available for new keys and keyrings.

    ACE subjects can be macroised to indicate the UID and GID specified
    on the key (which remain). Future commits will be able to add
    additional subject types, such as specific UIDs or domain
    tags/namespaces.

    Also split a number of permissions to give finer control. Examples
    include splitting the revocation permit from the change-attributes
    permit, thereby allowing someone to be granted permission to revoke
    a key without allowing them to change the owner; also the ability
    to join a keyring is split from the ability to link to it, thereby
    stopping a process accessing a keyring by joining it and thus
    acquiring use of possessor permits.

    - Provide a keyctl to allow the granting or denial of one or more
    permits to a specific subject. Direct access to the ACL is not
    granted, and the ACL cannot be viewed"

    * tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    keys: Provide KEYCTL_GRANT_PERMISSION
    keys: Replace uid/gid/perm permissions checking with an ACL

    Linus Torvalds
     
  • …/git/dhowells/linux-fs

    Pull keyring namespacing from David Howells:
    "These patches help make keys and keyrings more namespace aware.

    Firstly some miscellaneous patches to make the process easier:

    - Simplify key index_key handling so that the word-sized chunks
    assoc_array requires don't have to be shifted about, making it
    easier to add more bits into the key.

    - Cache the hash value in the key so that we don't have to calculate
    on every key we examine during a search (it involves a bunch of
    multiplications).

    - Allow keying_search() to search non-recursively.

    Then the main patches:

    - Make it so that keyring names are per-user_namespace from the point
    of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not
    accessible cross-user_namespace.

    keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this.

    - Move the user and user-session keyrings to the user_namespace
    rather than the user_struct. This prevents them propagating
    directly across user_namespaces boundaries (ie. the KEY_SPEC_*
    flags will only pick from the current user_namespace).

    - Make it possible to include the target namespace in which the key
    shall operate in the index_key. This will allow the possibility of
    multiple keys with the same description, but different target
    domains to be held in the same keyring.

    keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this.

    - Make it so that keys are implicitly invalidated by removal of a
    domain tag, causing them to be garbage collected.

    - Institute a network namespace domain tag that allows keys to be
    differentiated by the network namespace in which they operate. New
    keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned
    the network domain in force when they are created.

    - Make it so that the desired network namespace can be handed down
    into the request_key() mechanism. This allows AFS, NFS, etc. to
    request keys specific to the network namespace of the superblock.

    This also means that the keys in the DNS record cache are
    thenceforth namespaced, provided network filesystems pass the
    appropriate network namespace down into dns_query().

    For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other
    cache keyrings, such as idmapper keyrings, also need to set the
    domain tag - for which they need access to the network namespace of
    the superblock"

    * tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    keys: Pass the network namespace into request_key mechanism
    keys: Network namespace domain tag
    keys: Garbage collect keys for which the domain has been removed
    keys: Include target namespace in match criteria
    keys: Move the user and user-session keyrings to the user_namespace
    keys: Namespace keyring names
    keys: Add a 'recurse' flag for keyring searches
    keys: Cache the hash value to avoid lots of recalculation
    keys: Simplify key description management

    Linus Torvalds
     

08 Jul, 2019

14 commits