16 May, 2018

1 commit

  • commit 3a15b38fd2efc1d648cb33186bf71e9138c93491 upstream.

    rsize/wsize cap should be applied before ceph_osdc_new_request() is
    called. Otherwise, if the size is limited by the cap instead of the
    stripe unit, ceph_osdc_new_request() would setup an extent op that is
    bigger than what dio_get_pages_alloc() would pin and add to the page
    vector, triggering asserts in the messenger.

    Cc: stable@vger.kernel.org
    Fixes: 95cca2b44e54 ("ceph: limit osd write size")
    Signed-off-by: Ilya Dryomov
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     

08 Apr, 2018

1 commit

  • commit 85784f9395987a422fa04263e7c0fb13da11eb5c upstream.

    If a page is already locked, attempting to dirty it leads to a deadlock
    in lock_page(). This is what currently happens to ITER_BVEC pages when
    a dio-enabled loop device is backed by ceph:

    $ losetup --direct-io /dev/loop0 /mnt/cephfs/img
    $ xfs_io -c 'pread 0 4k' /dev/loop0

    Follow other file systems and only dirty ITER_IOVEC pages.

    Cc: stable@kernel.org
    Signed-off-by: "Yan, Zheng"
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Greg Kroah-Hartman

    Yan, Zheng
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

07 Sep, 2017

8 commits


07 Jul, 2017

1 commit

  • The old 'approaching max_size' code expects MDS set max_size to
    '2 * reported_size'. This is no longer true. The new code reports
    file size when half of previous max_size increment has been used.

    Signed-off-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     

25 May, 2017

1 commit


10 May, 2017

1 commit

  • Pull ceph updates from Ilya Dryomov:
    "The two main items are support for disabling automatic rbd exclusive
    lock transfers from myself and the long awaited -ENOSPC handling
    series from Jeff.

    The former will allow rbd users to take advantage of exclusive lock's
    built-in blacklist/break-lock functionality while staying in control
    of who owns the lock. With the latter in place, we will abort
    filesystem writes on -ENOSPC instead of having them block
    indefinitely.

    Beyond that we've got the usual pile of filesystem fixes from Zheng,
    some refcount_t conversion patches from Elena and a patch for an
    ancient open() flags handling bug from Alexander"

    * tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client: (31 commits)
    ceph: fix memory leak in __ceph_setxattr()
    ceph: fix file open flags on ppc64
    ceph: choose readdir frag based on previous readdir reply
    rbd: exclusive map option
    rbd: return ResponseMessage result from rbd_handle_request_lock()
    rbd: kill rbd_is_lock_supported()
    rbd: support updating the lock cookie without releasing the lock
    rbd: store lock cookie
    rbd: ignore unlock errors
    rbd: fix error handling around rbd_init_disk()
    rbd: move rbd_unregister_watch() call into rbd_dev_image_release()
    rbd: move rbd_dev_destroy() call out of rbd_dev_image_release()
    ceph: when seeing write errors on an inode, switch to sync writes
    Revert "ceph: SetPageError() for writeback pages if writepages fails"
    ceph: handle epoch barriers in cap messages
    libceph: add an epoch_barrier field to struct ceph_osd_client
    libceph: abort already submitted but abortable requests when map or pool goes full
    libceph: allow requests to return immediately on full conditions if caller wishes
    libceph: remove req->r_replay_version
    ceph: make seeky readdir more efficient
    ...

    Linus Torvalds
     

09 May, 2017

1 commit

  • There are many code paths opencoding kvmalloc. Let's use the helper
    instead. The main difference to kvmalloc is that those users are
    usually not considering all the aspects of the memory allocator. E.g.
    allocation requests
    Reviewed-by: Boris Ostrovsky # Xen bits
    Acked-by: Kees Cook
    Acked-by: Vlastimil Babka
    Acked-by: Andreas Dilger # Lustre
    Acked-by: Christian Borntraeger # KVM/s390
    Acked-by: Dan Williams # nvdim
    Acked-by: David Sterba # btrfs
    Acked-by: Ilya Dryomov # Ceph
    Acked-by: Tariq Toukan # mlx4
    Acked-by: Leon Romanovsky # mlx5
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Herbert Xu
    Cc: Anton Vorontsov
    Cc: Colin Cross
    Cc: Tony Luck
    Cc: "Rafael J. Wysocki"
    Cc: Ben Skeggs
    Cc: Kent Overstreet
    Cc: Santosh Raspatur
    Cc: Hariprasad S
    Cc: Yishai Hadas
    Cc: Oleg Drokin
    Cc: "Yan, Zheng"
    Cc: Alexander Viro
    Cc: Alexei Starovoitov
    Cc: Eric Dumazet
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 May, 2017

4 commits

  • The file open flags (O_foo) are platform specific and should never go
    out to an interface that is not local to the system.

    Unfortunately these flags have leaked out onto the wire in the cephfs
    implementation. That lead to bogus flags getting transmitted on ppc64.

    This patch converts the kernel view of flags to the ceph view of file
    open flags.

    Fixes: 124e68e74 ("ceph: file operations")
    Signed-off-by: Alexander Graf
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Alexander Graf
     
  • Currently, we don't have a real feedback mechanism in place for when we
    start seeing buffered writeback errors. If writeback is failing, there
    is nothing that prevents an application from continuing to dirty pages
    that aren't being cleaned.

    In the event that we're seeing write errors of any sort occur on an
    inode, have the callback set a flag to force further writes to be
    synchronous. When the next write succeeds, clear the flag to allow
    buffered writeback to continue.

    Since this is just a hint to the write submission mechanism, we only
    take the i_ceph_lock when a lockless check shows that the flag needs to
    be changed.

    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng”
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Usually, when the osd map is flagged as full or the pool is at quota,
    write requests just hang. This is not what we want for cephfs, where
    it would be better to simply report -ENOSPC back to userland instead
    of stalling.

    If the caller knows that it will want an immediate error return instead
    of blocking on a full or at-quota error condition then allow it to set a
    flag to request that behavior.

    Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
    and on any other write request from ceph.ko.

    A later patch will deal with requests that were submitted before the new
    map showing the full condition came in.

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • Signed-off-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     

25 Feb, 2017

2 commits


20 Feb, 2017

2 commits

  • struct ceph_mds_request has an r_locked_dir pointer, which is set to
    indicate the parent inode and that its i_rwsem is locked. In some
    critical places, we need to be able to indicate the parent inode to the
    request handling code, even when its i_rwsem may not be locked.

    Most of the code that operates on r_locked_dir doesn't require that the
    i_rwsem be locked. We only really need it to handle manipulation of the
    dcache. The rest (filling of the inode, updating dentry leases, etc.)
    already has its own locking.

    Add a new r_req_flags bit that indicates whether the parent is locked
    when doing the request, and rename the pointer to "r_parent". For now,
    all the places that set r_parent also set this flag, but that will
    change in a later patch.

    Signed-off-by: Jeff Layton
    Reviewed-by: Yan, Zheng
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     
  • __ceph_caps_mds_wanted() ignores caps from stale session. So the
    return value of __ceph_caps_mds_wanted() can keep the same across
    ceph_renew_caps(). This causes try_get_cap_refs() to keep calling
    ceph_renew_caps(). The fix is ignore the session valid check for
    the try_get_cap_refs() case. If session is stale, just let the
    caps requester sleep.

    Signed-off-by: Yan, Zheng

    Yan, Zheng
     

17 Dec, 2016

1 commit

  • Pull ceph updates from Ilya Dryomov:
    "A varied set of changes:

    - a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
    (myself). Also fixed a deadlock caused by a bogus allocation on the
    writeback path and authorize reply verification.

    - a fix for long stalls during fsync (Jeff Layton). The client now
    has a way to force the MDS log flush, leading to ~100x speedups in
    some synthetic tests.

    - a new [no]require_active_mds mount option (Zheng Yan).

    On mount, we will now check whether any of the MDSes are available
    and bail rather than block if none are. This check can be avoided
    by specifying the "no" option.

    - a couple of MDS cap handling fixes and a few assorted patches
    throughout"

    * tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client: (32 commits)
    libceph: remove now unused finish_request() wrapper
    libceph: always signal completion when done
    ceph: avoid creating orphan object when checking pool permission
    ceph: properly set issue_seq for cap release
    ceph: add flags parameter to send_cap_msg
    ceph: update cap message struct version to 10
    ceph: define new argument structure for send_cap_msg
    ceph: move xattr initialzation before the encoding past the ceph_mds_caps
    ceph: fix minor typo in unsafe_request_wait
    ceph: record truncate size/seq for snap data writeback
    ceph: check availability of mds cluster on mount
    ceph: fix splice read for no Fc capability case
    ceph: try getting buffer capability for readahead/fadvise
    ceph: fix scheduler warning due to nested blocking
    ceph: fix printing wrong return variable in ceph_direct_read_write()
    crush: include mapper.h in mapper.c
    rbd: silence bogus -Wmaybe-uninitialized warning
    libceph: no need to drop con->mutex for ->get_authorizer()
    libceph: drop len argument of *verify_authorizer_reply()
    libceph: verify authorize reply on connect
    ...

    Linus Torvalds
     

15 Dec, 2016

2 commits

  • Al Viro
     
  • r_safe_completion is currently, and has always been, signaled only if
    on-disk ack was requested. It's there for fsync and syncfs, which wait
    for in-flight writes to flush - all data write requests set ONDISK.

    However, the pool perm check code introduced in 4.2 sends a write
    request with only ACK set. An unfortunately timed syncfs can then hang
    forever: r_safe_completion won't be signaled because only an unsafe
    reply was requested.

    We could patch ceph_osdc_sync() to skip !ONDISK write requests, but
    that is somewhat incomplete and yet another special case. Instead,
    rename this completion to r_done_completion and always signal it when
    the OSD client is done with the request, whether unsafe, safe, or
    error. This is a bit cleaner and helps with the cancellation code.

    Reported-by: Yan, Zheng
    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

13 Dec, 2016

3 commits

  • When iov_iter type is ITER_PIPE, copy_page_to_iter() increases
    the page's reference and add the page to a pipe_buffer. It also
    set the pipe_buffer's ops to page_cache_pipe_buf_ops. The comfirm
    callback in page_cache_pipe_buf_ops expects the page is from page
    cache and uptodate, otherwise it return error.

    For ceph_sync_read() case, pages are not from page cache. So we
    can't call copy_page_to_iter() when iov_iter type is ITER_PIPE.
    The fix is using iov_iter_get_pages_alloc() to allocate pages
    for the pipe. (the code is similar to default_file_splice_read)

    Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • For readahead/fadvise cases, caller of ceph_readpages does not
    hold buffer capability. Pages can be added to page cache while
    there is no buffer capability. This can cause data integrity
    issue.

    Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • Fix printing wrong return variable for invalidate_inode_pages2_range in
    ceph_direct_read_write().

    Signed-off-by: Zhi Zhang
    Signed-off-by: Ilya Dryomov

    Zhi Zhang
     

11 Nov, 2016

1 commit

  • Splice read/write implementation changed recently. When using
    generic_file_splice_read(), iov_iter with type == ITER_PIPE is
    passed to filesystem's read_iter callback. But ceph_sync_read()
    can't serve ITER_PIPE iov_iter correctly (ITER_PIPE iov_iter
    expects pages from page cache).

    Fixing ceph_sync_read() requires a big patch. So use default
    splice read callback for now.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     

29 Oct, 2016

1 commit


16 Oct, 2016

1 commit

  • In case __ceph_do_getattr returns an error and the retry_op in
    ceph_read_iter is not READ_INLINE, then it's possible to invoke
    __free_page on a page which is NULL, this naturally leads to a crash.
    This can happen when, for example, a process waiting on a MDS reply
    receives sigterm.

    Fix this by explicitly checking whether the page is set or not.

    Cc: stable@vger.kernel.org # 3.19+
    Signed-off-by: Nikolay Borisov
    Reviewed-by: Yan, Zheng
    Signed-off-by: Ilya Dryomov

    Nikolay Borisov
     

11 Oct, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     

03 Oct, 2016

1 commit

  • This call can fail if there are dirty pages. The preceding call to
    filemap_write_and_wait_range() will normally remove dirty pages, but
    as inode_lock() is not held over calls to ceph_direct_read_write(), it
    could race with non-direct writes and pages could be dirtied
    immediately after filemap_write_and_wait_range() returns

    If there are dirty pages, they will be removed by the subsequent call
    to truncate_inode_pages_range(), so having them here is not a problem.

    If the 'ret' value is left holding an error, then in the async IO case
    (aio_req is not NULL) the loop that would normally call
    ceph_osdc_start_request() will see the error in 'ret' and abort all
    requests. This doesn't seem like correct behaviour.

    So use separate 'ret2' instead of overloading 'ret'.

    Signed-off-by: NeilBrown
    Reviewed-by: Jeff Layton
    Reviewed-by: Yan, Zheng

    NeilBrown
     

28 Sep, 2016

1 commit

  • current_fs_time() uses struct super_block* as an argument.
    As per Linus's suggestion, this is changed to take struct
    inode* as a parameter instead. This is because the function
    is primarily meant for vfs inode timestamps.
    Also the function was renamed as per Arnd's suggestion.

    Change all calls to current_fs_time() to use the new
    current_time() function instead. current_fs_time() will be
    deleted.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Al Viro

    Deepa Dinamani
     

28 Jul, 2016

5 commits