05 Mar, 2020

1 commit

  • [ Upstream commit 8e4473bb50a1796c9c32b244e5dbc5ee24ead937 ]

    In O_APPEND & O_DIRECT mode, the data from different writers will
    be possibly overlapping each other since they take the shared lock.

    For example, both Writer1 and Writer2 are in O_APPEND and O_DIRECT
    mode:

    Writer1 Writer2

    shared_lock() shared_lock()
    getattr(CAP_SIZE) getattr(CAP_SIZE)
    iocb->ki_pos = EOF iocb->ki_pos = EOF
    write(data1)
    write(data2)
    shared_unlock() shared_unlock()

    The data2 will overlap the data1 from the same file offset, the
    old EOF.

    Switch to exclusive lock instead when O_APPEND is specified.

    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Sasha Levin

    Xiubo Li
     

24 Feb, 2020

1 commit

  • [ Upstream commit 97820058fb2831a4b203981fa2566ceaaa396103 ]

    If all the MDS daemons are down for some reason, then the first mount
    attempt will fail with EIO after the mount request times out. A mount
    attempt will also fail with EIO if all of the MDS's are laggy.

    This patch changes the code to return -EHOSTUNREACH in these situations
    and adds a pr_info error message to help the admin determine the cause.

    URL: https://tracker.ceph.com/issues/4386
    Signed-off-by: Xiubo Li
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Sasha Levin

    Xiubo Li
     

29 Jan, 2020

1 commit

  • commit 9c1c2b35f1d94de8325344c2777d7ee67492db3b upstream.

    Currently, we just assume that it will stick around by virtue of the
    submitter's reference, but later patches will allow the syscall to
    return early and we can't rely on that reference at that point.

    While I'm not aware of any reports of it, Xiubo pointed out that this
    may fix a use-after-free. If the wait for a reply times out or is
    canceled via signal, and then the reply comes in after the syscall
    returns, the client can end up trying to access r_parent without a
    reference.

    Take an extra reference to the inode when setting r_parent and release
    it when releasing the request.

    Cc: stable@vger.kernel.org
    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     

18 Dec, 2019

1 commit

  • commit 18bd6caaef4021803dd0d031dc37c2d001d18a5b upstream.

    The ceph_ioctl function is used both for files and directories, but only
    the files support doing that in 32-bit compat mode.

    On the s390 architecture, there is also a problem with invalid 31-bit
    pointers that need to be passed through compat_ptr().

    Use the new compat_ptr_ioctl() to address both issues.

    Note: When backporting this patch to stable kernels, "compat_ioctl:
    add compat_ptr_ioctl()" is needed as well.

    Reviewed-by: "Yan, Zheng"
    Cc: stable@vger.kernel.org
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     

15 Nov, 2019

2 commits

  • Ceph can in some cases issue an async DIO request, in which case we can
    end up calling ceph_end_io_direct before the I/O is actually complete.
    That may allow buffered operations to proceed while DIO requests are
    still in flight.

    Fix this by incrementing the i_dio_count when issuing an async DIO
    request, and decrement it when tearing down the aio_req.

    Fixes: 321fe13c9398 ("ceph: add buffered/direct exclusionary locking for reads and writes")
    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Most of the time, we (or the vfs layer) takes the inode_lock and then
    acquires caps, but ceph_read_iter does the opposite, and that can lead
    to a deadlock.

    When there are multiple clients treading over the same data, we can end
    up in a situation where a reader takes caps and then tries to acquire
    the inode_lock. Another task holds the inode_lock and issues a request
    to the MDS which needs to revoke the caps, but that can't happen until
    the inode_lock is unwedged.

    Fix this by having ceph_read_iter take the inode_lock earlier, before
    attempting to acquire caps.

    Fixes: 321fe13c9398 ("ceph: add buffered/direct exclusionary locking for reads and writes")
    Link: https://tracker.ceph.com/issues/36348
    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

08 Nov, 2019

1 commit


05 Nov, 2019

2 commits

  • copy_file_range tries to use the OSD 'copy-from' operation, which simply
    performs a full object copy. Unfortunately, the implementation of this
    system call assumes that stripe_count is always set to 1 and doesn't take
    into account that the data may be striped across an object set. If the
    file layout has stripe_count different from 1, then the destination file
    data will be corrupted.

    For example:

    Consider a 8 MiB file with 4 MiB object size, stripe_count of 2 and
    stripe_size of 2 MiB; the first half of the file will be filled with 'A's
    and the second half will be filled with 'B's:

    0 4M 8M Obj1 Obj2
    +------+------+ +----+ +----+
    file: | AAAA | BBBB | | AA | | AA |
    +------+------+ |----| |----|
    | BB | | BB |
    +----+ +----+

    If we copy_file_range this file into a new file (which needs to have the
    same file layout!), then it will start by copying the object starting at
    file offset 0 (Obj1). And then it will copy the object starting at file
    offset 4M -- which is Obj1 again.

    Unfortunately, the solution for this is to not allow remote object copies
    to be performed when the file layout stripe_count is not 1 and simply
    fallback to the default (VFS) copy_file_range implementation.

    Cc: stable@vger.kernel.org
    Signed-off-by: Luis Henriques
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Luis Henriques
     
  • If ceph_atomic_open is handed a !d_in_lookup dentry, then that means
    that it already passed d_revalidate so we *know* that it's negative (or
    at least was very recently). Just return -ENOENT in that case.

    This also addresses a subtle bug in dentry handling. Non-O_CREAT opens
    call atomic_open with the parent's i_rwsem shared, but calling
    d_splice_alias on a hashed dentry requires the exclusive lock.

    If ceph_atomic_open receives a hashed, negative dentry on a non-O_CREAT
    open, and another client were to race in and create the file before we
    issue our OPEN, ceph_fill_trace could end up calling d_splice_alias on
    the dentry with the new inode with insufficient locks.

    Cc: stable@vger.kernel.org
    Reported-by: Al Viro
    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

30 Oct, 2019

3 commits

  • We should not play with dcache without parent locked...

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Al Viro
     
  • For RCU case ->d_revalidate() is called with rcu_read_lock() and
    without pinning the dentry passed to it. Which means that it
    can't rely upon ->d_inode remaining stable; that's the reason
    for d_inode_rcu(), actually.

    Make sure we don't reload ->d_inode there.

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Al Viro
     
  • KASAN reports a use-after-free when running xfstest generic/531, with the
    following trace:

    [ 293.903362] kasan_report+0xe/0x20
    [ 293.903365] rb_erase+0x1f/0x790
    [ 293.903370] __ceph_remove_cap+0x201/0x370
    [ 293.903375] __ceph_remove_caps+0x4b/0x70
    [ 293.903380] ceph_evict_inode+0x4e/0x360
    [ 293.903386] evict+0x169/0x290
    [ 293.903390] __dentry_kill+0x16f/0x250
    [ 293.903394] dput+0x1c6/0x440
    [ 293.903398] __fput+0x184/0x330
    [ 293.903404] task_work_run+0xb9/0xe0
    [ 293.903410] exit_to_usermode_loop+0xd3/0xe0
    [ 293.903413] do_syscall_64+0x1a0/0x1c0
    [ 293.903417] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    This happens because __ceph_remove_cap() may queue a cap release
    (__ceph_queue_cap_release) which can be scheduled before that cap is
    removed from the inode list with

    rb_erase(&cap->ci_node, &ci->i_caps);

    And, when this finally happens, the use-after-free will occur.

    This can be fixed by removing the cap from the inode list before being
    removed from the session list, and thus eliminating the risk of an UAF.

    Cc: stable@vger.kernel.org
    Signed-off-by: Luis Henriques
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Luis Henriques
     

15 Oct, 2019

1 commit

  • In the future, we're going to want to extend the ceph_reply_info_extra
    for create replies. Currently though, the kernel code doesn't accept an
    extra blob that is larger than the expected data.

    Change the code to skip over any unrecognized fields at the end of the
    extra blob, rather than returning -EIO.

    Cc: stable@vger.kernel.org
    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

26 Sep, 2019

1 commit

  • Pull ceph updates from Ilya Dryomov:
    "The highlights are:

    - automatic recovery of a blacklisted filesystem session (Zheng Yan).
    This is disabled by default and can be enabled by mounting with the
    new "recover_session=clean" option.

    - serialize buffered reads and O_DIRECT writes (Jeff Layton). Care is
    taken to avoid serializing O_DIRECT reads and writes with each
    other, this is based on the exclusion scheme from NFS.

    - handle large osdmaps better in the face of fragmented memory
    (myself)

    - don't limit what security.* xattrs can be get or set (Jeff Layton).
    We were overly restrictive here, unnecessarily preventing things
    like file capability sets stored in security.capability from
    working.

    - allow copy_file_range() within the same inode and across different
    filesystems within the same cluster (Luis Henriques)"

    * tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client: (41 commits)
    ceph: call ceph_mdsc_destroy from destroy_fs_client
    libceph: use ceph_kvmalloc() for osdmap arrays
    libceph: avoid a __vmalloc() deadlock in ceph_kvmalloc()
    ceph: allow object copies across different filesystems in the same cluster
    ceph: include ceph_debug.h in cache.c
    ceph: move static keyword to the front of declarations
    rbd: pull rbd_img_request_create() dout out into the callers
    ceph: reconnect connection if session hang in opening state
    libceph: drop unused con parameter of calc_target()
    ceph: use release_pages() directly
    rbd: fix response length parameter for encoded strings
    ceph: allow arbitrary security.* xattrs
    ceph: only set CEPH_I_SEC_INITED if we got a MAC label
    ceph: turn ceph_security_invalidate_secctx into static inline
    ceph: add buffered/direct exclusionary locking for reads and writes
    libceph: handle OSD op ceph_pagelist_append() errors
    ceph: don't return a value from void function
    ceph: don't freeze during write page faults
    ceph: update the mtime when truncating up
    ceph: fix indentation in __get_snap_name()
    ...

    Linus Torvalds
     

16 Sep, 2019

26 commits

  • They're always called in succession.

    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • OSDs are able to perform object copies across different pools. Thus,
    there's no need to prevent copy_file_range from doing remote copies if the
    source and destination superblocks are different. Only return -EXDEV if
    they have different fsid (the cluster ID).

    Signed-off-by: Luis Henriques
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Luis Henriques
     
  • Any file that uses dout() should include ceph_debug.h at the top.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Move the static keyword to the front of declarations of
    snap_handle_length, handle_length and connected_handle_length,
    and resolve the following compiler warnings that can be seen
    when building with warnings enabled (W=1):

    fs/ceph/export.c:38:2: warning:
    ‘static’ is not at beginning of declaration [-Wold-style-declaration]

    fs/ceph/export.c:88:2: warning:
    ‘static’ is not at beginning of declaration [-Wold-style-declaration]

    fs/ceph/export.c:90:2: warning:
    ‘static’ is not at beginning of declaration [-Wold-style-declaration]

    Signed-off-by: Krzysztof Wilczynski
    Signed-off-by: Ilya Dryomov

    Krzysztof Wilczynski
     
  • If client mds session is evicted in CEPH_MDS_SESSION_OPENING state,
    mds won't send session msg to client, and delayed_work skip
    CEPH_MDS_SESSION_OPENING state session, the session hang forever.

    Allow ceph_con_keepalive to reconnect a session in OPENING to avoid
    session hang. Also, ensure that we skip sessions in RESTARTING and
    REJECTED states since those states can't be resurrected by issuing
    a keepalive.

    Link: https://tracker.ceph.com/issues/41551
    Signed-off-by: Erqi Chen chenerqi@gmail.com
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Erqi Chen
     
  • release_pages() has been available to modules since Oct, 2010,
    when commit 0be8557bcd34 ("fuse: use release_pages()") added
    EXPORT_SYMBOL(release_pages). However, this ceph code was still
    using a workaround.

    Remove the workaround, and call release_pages() directly.

    Signed-off-by: John Hubbard
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    John Hubbard
     
  • Most filesystems don't limit what security.* xattrs can be set or
    fetched. I see no reason that we need to limit that on cephfs either.

    Drop the special xattr handler for "security." xattrs, and allow the
    "other" xattr handler to handle security xattrs as well.

    In addition to fixing xfstest generic/093, this allows us to support
    per-file capabilities (a'la setcap(8)).

    Link: https://tracker.ceph.com/issues/41135
    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • __ceph_getxattr will set the CEPH_I_SEC_INITED flag whenever it gets
    any xattr that starts with "security.". We only want to set that flag
    when fetching the MAC label for the currently-active LSM, however.

    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • No need to do an extra jump here. Also add some comments on the endifs.

    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • xfstest generic/451 intermittently fails. The test does O_DIRECT writes
    to a file, and then reads back the result using buffered I/O, while
    running a separate set of tasks that are also doing buffered reads.

    The client will invalidate the cache prior to a direct write, but it's
    easy for one of the other readers' replies to race in and reinstantiate
    the invalidated range with stale data.

    To fix this, we must to serialize direct I/O writes and buffered reads.
    We could just sprinkle in some shared locks on the i_rwsem for reads,
    and increase the exclusive footprint on the write side, but that would
    cause O_DIRECT writes to end up serialized vs. other direct requests.

    Instead, borrow the scheme used by nfs.ko. Buffered writes take the
    i_rwsem exclusively, but buffered reads take a shared lock, allowing
    them to run in parallel.

    O_DIRECT requests also take a shared lock, but we need for them to not
    run in parallel with buffered reads. A flag on the ceph_inode_info is
    used to indicate whether it's in direct or buffered I/O mode. When a
    conflicting request is submitted, it will block until the inode can be
    flipped to the necessary mode.

    Link: https://tracker.ceph.com/issues/40985
    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • This fixes a build warning to that effect.

    Fixes: 1a829ff2a6c3 ("ceph: no need to check return value of debugfs_create functions")
    Signed-off-by: John Hubbard
    Signed-off-by: Ilya Dryomov

    John Hubbard
     
  • Prevent freezing operations during write page faults. This is good
    practice for most filesystems, but especially for ceph since we're
    monkeying with the signal table here.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • If we have Fx caps, and the we're truncating the size to be larger, then
    we'll cache the size attribute change, but the mtime won't be updated.

    Move the size handling before the mtime, and add ATTR_MTIME to ia_valid
    in that case to make sure the mtime also gets updated.

    This fixes xfstest generic/313.

    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Reported-by: kbuild test robot
    Reported-by: Julia Lawall
    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • It doesn't do anything to invalidate the cache when dropping RD caps.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Nothing sets this flag.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • cap->session is always non-NULL, so we can just do a single test for
    equality w/o testing explicitly for a NULL pointer.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Currently, this function returns ci->i_dirty_caps, but the callers have
    to check that that isn't 0 before calling this function. Have the
    callers grab that value directly out of the inode, and have
    __mark_caps_flushing return the flush_tid instead.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • We actually need the ci->i_ceph_lock here. The necessity of the s_mutex
    is less clear. Also add a lockdep assertion for the i_ceph_lock.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • It's only used to keep count of caps being trimmed, but that requires
    that we hold the session->s_mutex to prevent multiple trimming
    operations from running concurrently.

    We can achieve the same effect using an integer on the stack, which
    allows us to (eventually) not need the s_mutex.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • It's protected by the s_gen_ttl_lock, so we should fetch under it
    and ensure that we're using the same generation in both places.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Nothing calls these routines.

    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • We already mark the mapping in that case, and doing this can cause
    false positives to occur at fsync time, as well as spurious read
    errors.

    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Make client use osd reply and session message to infer if itself is
    blacklisted. Client reconnect to cluster using new entity addr if it
    is blacklisted. Auto reconnect is limited to once every 30 minutes.

    Auto reconnect is disabled by default. It can be enabled/disabled by
    recover_session= mount option. In 'clean' mode, client drops
    any dirty data/metadata, invalidates page caches and invalidates all
    writable file handles. After reconnect, file locks become stale because
    MDS loses track of them. If an inode contains any stale file locks,
    read/write on the indoe are not allowed until applications release all
    stale file locks.

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • After mds evicts session, file locks get lost sliently. It's not safe to
    let programs continue to do read/write.

    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Jeff Layton
    Signed-off-by: Ilya Dryomov

    Yan, Zheng