16 May, 2018
1 commit
-
commit 3a15b38fd2efc1d648cb33186bf71e9138c93491 upstream.
rsize/wsize cap should be applied before ceph_osdc_new_request() is
called. Otherwise, if the size is limited by the cap instead of the
stripe unit, ceph_osdc_new_request() would setup an extent op that is
bigger than what dio_get_pages_alloc() would pin and add to the page
vector, triggering asserts in the messenger.Cc: stable@vger.kernel.org
Fixes: 95cca2b44e54 ("ceph: limit osd write size")
Signed-off-by: Ilya Dryomov
Reviewed-by: "Yan, Zheng"
Signed-off-by: Greg Kroah-Hartman
08 Apr, 2018
1 commit
-
commit 85784f9395987a422fa04263e7c0fb13da11eb5c upstream.
If a page is already locked, attempting to dirty it leads to a deadlock
in lock_page(). This is what currently happens to ITER_BVEC pages when
a dio-enabled loop device is backed by ceph:$ losetup --direct-io /dev/loop0 /mnt/cephfs/img
$ xfs_io -c 'pread 0 4k' /dev/loop0Follow other file systems and only dirty ITER_IOVEC pages.
Cc: stable@kernel.org
Signed-off-by: "Yan, Zheng"
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov
Signed-off-by: Greg Kroah-Hartman
02 Nov, 2017
1 commit
-
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
07 Sep, 2017
8 commits
-
The script “checkpatch.pl” pointed information out like the following.
Comparison to NULL could be written ...
Thus fix the affected source code places.
Signed-off-by: Markus Elfring
Reviewed-by: Yan, Zheng
Signed-off-by: Ilya Dryomov -
When a user requests SEEK_HOLE or SEEK_DATA with a negative offset
ceph_llseek should return -ENXIO. Currently -EINVAL is being returned for
SEEK_DATA and 0 for SEEK_HOLE.Signed-off-by: Luis Henriques
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov -
Inode can be moved between snap realms. It's possible inode is moved
into a snap realm whose seq number is smaller than old snap realm's.
So there is no guarantee that seq number inode's snap context always
increases.Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Need to drop cap reference before retry. Besides, it's better to
redo file write checks for each retry because we re-lock inode.Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
startsync is a no-op, has been for years. Remove it.
Link: http://tracker.ceph.com/issues/20604
Signed-off-by: Yanhu Cao
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
OSD has a configurable limitation of max write size. OSD return
error if write request size is larger than the limitation. For now,
set max write size to CEPH_MSG_MAX_DATA_LEN. It should be small
enough.Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
libceph returns -EIO when read size > CEPH_MSG_MAX_DATA_LEN.
Link: http://tracker.ceph.com/issues/20528
Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
07 Jul, 2017
1 commit
-
The old 'approaching max_size' code expects MDS set max_size to
'2 * reported_size'. This is no longer true. The new code reports
file size when half of previous max_size increment has been used.Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
25 May, 2017
1 commit
-
Currently the ceph client doesn't respect the rlimit in fallocate. This
means that a user can allocate a file with size > RLIMIT_FSIZE. This
patch adds the call to inode_newsize_ok() to verify filesystem limits and
ulimits. This should make ceph successfully run xfstest generic/228.Signed-off-by: Luis Henriques
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
10 May, 2017
1 commit
-
Pull ceph updates from Ilya Dryomov:
"The two main items are support for disabling automatic rbd exclusive
lock transfers from myself and the long awaited -ENOSPC handling
series from Jeff.The former will allow rbd users to take advantage of exclusive lock's
built-in blacklist/break-lock functionality while staying in control
of who owns the lock. With the latter in place, we will abort
filesystem writes on -ENOSPC instead of having them block
indefinitely.Beyond that we've got the usual pile of filesystem fixes from Zheng,
some refcount_t conversion patches from Elena and a patch for an
ancient open() flags handling bug from Alexander"* tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client: (31 commits)
ceph: fix memory leak in __ceph_setxattr()
ceph: fix file open flags on ppc64
ceph: choose readdir frag based on previous readdir reply
rbd: exclusive map option
rbd: return ResponseMessage result from rbd_handle_request_lock()
rbd: kill rbd_is_lock_supported()
rbd: support updating the lock cookie without releasing the lock
rbd: store lock cookie
rbd: ignore unlock errors
rbd: fix error handling around rbd_init_disk()
rbd: move rbd_unregister_watch() call into rbd_dev_image_release()
rbd: move rbd_dev_destroy() call out of rbd_dev_image_release()
ceph: when seeing write errors on an inode, switch to sync writes
Revert "ceph: SetPageError() for writeback pages if writepages fails"
ceph: handle epoch barriers in cap messages
libceph: add an epoch_barrier field to struct ceph_osd_client
libceph: abort already submitted but abortable requests when map or pool goes full
libceph: allow requests to return immediately on full conditions if caller wishes
libceph: remove req->r_replay_version
ceph: make seeky readdir more efficient
...
09 May, 2017
1 commit
-
There are many code paths opencoding kvmalloc. Let's use the helper
instead. The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator. E.g.
allocation requests
Reviewed-by: Boris Ostrovsky # Xen bits
Acked-by: Kees Cook
Acked-by: Vlastimil Babka
Acked-by: Andreas Dilger # Lustre
Acked-by: Christian Borntraeger # KVM/s390
Acked-by: Dan Williams # nvdim
Acked-by: David Sterba # btrfs
Acked-by: Ilya Dryomov # Ceph
Acked-by: Tariq Toukan # mlx4
Acked-by: Leon Romanovsky # mlx5
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Herbert Xu
Cc: Anton Vorontsov
Cc: Colin Cross
Cc: Tony Luck
Cc: "Rafael J. Wysocki"
Cc: Ben Skeggs
Cc: Kent Overstreet
Cc: Santosh Raspatur
Cc: Hariprasad S
Cc: Yishai Hadas
Cc: Oleg Drokin
Cc: "Yan, Zheng"
Cc: Alexander Viro
Cc: Alexei Starovoitov
Cc: Eric Dumazet
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 May, 2017
4 commits
-
The file open flags (O_foo) are platform specific and should never go
out to an interface that is not local to the system.Unfortunately these flags have leaked out onto the wire in the cephfs
implementation. That lead to bogus flags getting transmitted on ppc64.This patch converts the kernel view of flags to the ceph view of file
open flags.Fixes: 124e68e74 ("ceph: file operations")
Signed-off-by: Alexander Graf
Reviewed-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov -
Currently, we don't have a real feedback mechanism in place for when we
start seeing buffered writeback errors. If writeback is failing, there
is nothing that prevents an application from continuing to dirty pages
that aren't being cleaned.In the event that we're seeing write errors of any sort occur on an
inode, have the callback set a flag to force further writes to be
synchronous. When the next write succeeds, clear the flag to allow
buffered writeback to continue.Since this is just a hint to the write submission mechanism, we only
take the i_ceph_lock when a lockless check shows that the flag needs to
be changed.Signed-off-by: Jeff Layton
Reviewed-by: "Yan, Zheng”
Signed-off-by: Ilya Dryomov -
Usually, when the osd map is flagged as full or the pool is at quota,
write requests just hang. This is not what we want for cephfs, where
it would be better to simply report -ENOSPC back to userland instead
of stalling.If the caller knows that it will want an immediate error return instead
of blocking on a full or at-quota error condition then allow it to set a
flag to request that behavior.Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
and on any other write request from ceph.ko.A later patch will deal with requests that were submitted before the new
map showing the full condition came in.Signed-off-by: Jeff Layton
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov -
Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
25 Feb, 2017
2 commits
-
CEPH_OSD_FLAG_ONDISK is set in account_request().
Signed-off-by: Ilya Dryomov
Reviewed-by: Jeff Layton
Reviewed-by: Sage Weil -
- ask for a commit reply instead of an ack reply in
__ceph_pool_perm_get()
- don't ask for both ack and commit replies in ceph_sync_write()
- since just only one reply is requested now, i_unsafe_writes list
will always be empty -- kill ceph_sync_write_wait() and go back to
a standard ->evict_inode()Signed-off-by: Ilya Dryomov
Reviewed-by: Jeff Layton
Reviewed-by: Sage Weil
20 Feb, 2017
2 commits
-
struct ceph_mds_request has an r_locked_dir pointer, which is set to
indicate the parent inode and that its i_rwsem is locked. In some
critical places, we need to be able to indicate the parent inode to the
request handling code, even when its i_rwsem may not be locked.Most of the code that operates on r_locked_dir doesn't require that the
i_rwsem be locked. We only really need it to handle manipulation of the
dcache. The rest (filling of the inode, updating dentry leases, etc.)
already has its own locking.Add a new r_req_flags bit that indicates whether the parent is locked
when doing the request, and rename the pointer to "r_parent". For now,
all the places that set r_parent also set this flag, but that will
change in a later patch.Signed-off-by: Jeff Layton
Reviewed-by: Yan, Zheng
Signed-off-by: Ilya Dryomov -
__ceph_caps_mds_wanted() ignores caps from stale session. So the
return value of __ceph_caps_mds_wanted() can keep the same across
ceph_renew_caps(). This causes try_get_cap_refs() to keep calling
ceph_renew_caps(). The fix is ignore the session valid check for
the try_get_cap_refs() case. If session is stale, just let the
caps requester sleep.Signed-off-by: Yan, Zheng
17 Dec, 2016
1 commit
-
Pull ceph updates from Ilya Dryomov:
"A varied set of changes:- a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
(myself). Also fixed a deadlock caused by a bogus allocation on the
writeback path and authorize reply verification.- a fix for long stalls during fsync (Jeff Layton). The client now
has a way to force the MDS log flush, leading to ~100x speedups in
some synthetic tests.- a new [no]require_active_mds mount option (Zheng Yan).
On mount, we will now check whether any of the MDSes are available
and bail rather than block if none are. This check can be avoided
by specifying the "no" option.- a couple of MDS cap handling fixes and a few assorted patches
throughout"* tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client: (32 commits)
libceph: remove now unused finish_request() wrapper
libceph: always signal completion when done
ceph: avoid creating orphan object when checking pool permission
ceph: properly set issue_seq for cap release
ceph: add flags parameter to send_cap_msg
ceph: update cap message struct version to 10
ceph: define new argument structure for send_cap_msg
ceph: move xattr initialzation before the encoding past the ceph_mds_caps
ceph: fix minor typo in unsafe_request_wait
ceph: record truncate size/seq for snap data writeback
ceph: check availability of mds cluster on mount
ceph: fix splice read for no Fc capability case
ceph: try getting buffer capability for readahead/fadvise
ceph: fix scheduler warning due to nested blocking
ceph: fix printing wrong return variable in ceph_direct_read_write()
crush: include mapper.h in mapper.c
rbd: silence bogus -Wmaybe-uninitialized warning
libceph: no need to drop con->mutex for ->get_authorizer()
libceph: drop len argument of *verify_authorizer_reply()
libceph: verify authorize reply on connect
...
15 Dec, 2016
2 commits
-
r_safe_completion is currently, and has always been, signaled only if
on-disk ack was requested. It's there for fsync and syncfs, which wait
for in-flight writes to flush - all data write requests set ONDISK.However, the pool perm check code introduced in 4.2 sends a write
request with only ACK set. An unfortunately timed syncfs can then hang
forever: r_safe_completion won't be signaled because only an unsafe
reply was requested.We could patch ceph_osdc_sync() to skip !ONDISK write requests, but
that is somewhat incomplete and yet another special case. Instead,
rename this completion to r_done_completion and always signal it when
the OSD client is done with the request, whether unsafe, safe, or
error. This is a bit cleaner and helps with the cancellation code.Reported-by: Yan, Zheng
Signed-off-by: Ilya Dryomov
13 Dec, 2016
3 commits
-
When iov_iter type is ITER_PIPE, copy_page_to_iter() increases
the page's reference and add the page to a pipe_buffer. It also
set the pipe_buffer's ops to page_cache_pipe_buf_ops. The comfirm
callback in page_cache_pipe_buf_ops expects the page is from page
cache and uptodate, otherwise it return error.For ceph_sync_read() case, pages are not from page cache. So we
can't call copy_page_to_iter() when iov_iter type is ITER_PIPE.
The fix is using iov_iter_get_pages_alloc() to allocate pages
for the pipe. (the code is similar to default_file_splice_read)Signed-off-by: Yan, Zheng
-
For readahead/fadvise cases, caller of ceph_readpages does not
hold buffer capability. Pages can be added to page cache while
there is no buffer capability. This can cause data integrity
issue.Signed-off-by: Yan, Zheng
-
Fix printing wrong return variable for invalidate_inode_pages2_range in
ceph_direct_read_write().Signed-off-by: Zhi Zhang
Signed-off-by: Ilya Dryomov
11 Nov, 2016
1 commit
-
Splice read/write implementation changed recently. When using
generic_file_splice_read(), iov_iter with type == ITER_PIPE is
passed to filesystem's read_iter callback. But ceph_sync_read()
can't serve ITER_PIPE iov_iter correctly (ITER_PIPE iov_iter
expects pages from page cache).Fixing ceph_sync_read() requires a big patch. So use default
splice read callback for now.Signed-off-by: Yan, Zheng
Signed-off-by: Ilya Dryomov
29 Oct, 2016
1 commit
-
Signed-off-by: Al Viro
16 Oct, 2016
1 commit
-
In case __ceph_do_getattr returns an error and the retry_op in
ceph_read_iter is not READ_INLINE, then it's possible to invoke
__free_page on a page which is NULL, this naturally leads to a crash.
This can happen when, for example, a process waiting on a MDS reply
receives sigterm.Fix this by explicitly checking whether the page is set or not.
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Nikolay Borisov
Reviewed-by: Yan, Zheng
Signed-off-by: Ilya Dryomov
11 Oct, 2016
1 commit
-
Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning
03 Oct, 2016
1 commit
-
This call can fail if there are dirty pages. The preceding call to
filemap_write_and_wait_range() will normally remove dirty pages, but
as inode_lock() is not held over calls to ceph_direct_read_write(), it
could race with non-direct writes and pages could be dirtied
immediately after filemap_write_and_wait_range() returnsIf there are dirty pages, they will be removed by the subsequent call
to truncate_inode_pages_range(), so having them here is not a problem.If the 'ret' value is left holding an error, then in the async IO case
(aio_req is not NULL) the loop that would normally call
ceph_osdc_start_request() will see the error in 'ret' and abort all
requests. This doesn't seem like correct behaviour.So use separate 'ret2' instead of overloading 'ret'.
Signed-off-by: NeilBrown
Reviewed-by: Jeff Layton
Reviewed-by: Yan, Zheng
28 Sep, 2016
1 commit
-
current_fs_time() uses struct super_block* as an argument.
As per Linus's suggestion, this is changed to take struct
inode* as a parameter instead. This is because the function
is primarily meant for vfs inode timestamps.
Also the function was renamed as per Arnd's suggestion.Change all calls to current_fs_time() to use the new
current_time() function instead. current_fs_time() will be
deleted.Signed-off-by: Deepa Dinamani
Signed-off-by: Al Viro
28 Jul, 2016
5 commits
-
ceph_llseek does not correctly return NXIO errors because the 'out' path
always returns 'offset'.Fixes: 06222e491e66 ("fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek")
Signed-off-by: Phil Turnbull
Signed-off-by: Yan, Zheng -
Otherwise ceph_sync_write_unsafe() may access/modify freed inode.
Signed-off-by: Yan, Zheng
-
ceph_aio_complete() can free the ceph_aio_request struct before
the code exits the while loop.Signed-off-by: Yan, Zheng
-
Signed-off-by: Yan, Zheng
-
Define new ceph_file_layout structure and rename old ceph_file_layout
to ceph_file_layout_legacy. This is preparation for adding namespace
to ceph_file_layout structure.Signed-off-by: Yan, Zheng