09 Oct, 2014
2 commits
-
Now that d_invalidate can no longer fail, stop returning a useless
return code. For the few callers that checked the return code update
remove the handling of d_invalidate failure.Reviewed-by: Miklos Szeredi
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Al Viro -
Now that check_submounts_and_drop can not fail and is called from
d_invalidate there is no longer a need to call check_submounts_and_drom
from filesystem d_revalidate methods so remove it.Reviewed-by: Miklos Szeredi
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Al Viro
27 Sep, 2014
1 commit
-
The third argument of fuse_get_user_pages() "nbytesp" refers to the number of
bytes a caller asked to pack into fuse request. This value may be lesser
than capacity of fuse request or iov_iter. So fuse_get_user_pages() must
ensure that *nbytesp won't grow.Now, when helper iov_iter_get_pages() performs all hard work of extracting
pages from iov_iter, it can be done by passing properly calculated
"maxsize" to the helper.The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need
this capability, so pass LONG_MAX as the maxsize argument here.Fixes: c9c37e2e6378 ("fuse: switch to iov_iter_get_pages()")
Reported-by: Werner Baumann
Tested-by: Maxim Patlasov
Signed-off-by: Miklos Szeredi
Signed-off-by: Al Viro
08 Aug, 2014
2 commits
-
... instead of maximal size.
Signed-off-by: Al Viro
-
Christoph Hellwig suggests:
1) make vfs_rename call ->rename2 if it exists instead of ->rename
2) switch all filesystems that you're adding NOREPLACE support for to
use ->rename2
3) see how many ->rename instances we'll have left after a few
iterations of 2.Signed-off-by: Miklos Szeredi
Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro
22 Jul, 2014
2 commits
-
Here some additional changes to set a capability flag so that clients can
detect when it's appropriate to return -ENOSYS from open.This amends the following commit introduced in 3.14:
7678ac50615d fuse: support clients that don't implement 'open'
However we can only add the flag to 3.15 and later since there was no
protocol version update in 3.14.Signed-off-by: Miklos Szeredi
Cc: # v3.15+ -
Default s_time_gran is 1, don't overwrite that if userspace didn't
explicitly specify one.Signed-off-by: Miklos Szeredi
Cc: # v3.15+
15 Jul, 2014
1 commit
-
Pull fuse fixes from Miklos Szeredi:
"This contains miscellaneous fixes"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: replace count*size kzalloc by kcalloc
fuse: release temporary page if fuse_writepage_locked() failed
fuse: restructure ->rename2()
fuse: avoid scheduling while atomic
fuse: handle large user and group ID
fuse: inode: drop cast
fuse: ignore entry-timeout on LOOKUP_REVAL
fuse: timeout comparison fix
14 Jul, 2014
2 commits
-
kcalloc manages count*sizeof overflow.
Signed-off-by: Fabian Frederick
Signed-off-by: Miklos Szeredi -
tmp_page to be freed if fuse_write_file_get() returns NULL.
Signed-off-by: Maxim Patlasov
Signed-off-by: Miklos Szeredi
10 Jul, 2014
1 commit
-
Make ->rename2() universal, i.e. able to handle zero flags. This is to
make future change of the API easier.Signed-off-by: Miklos Szeredi
07 Jul, 2014
5 commits
-
As reported by Richard Sharpe, an attempt to use fuse_notify_inval_entry()
triggers complains about scheduling while atomic:BUG: scheduling while atomic: fuse.hf/13976/0x10000001
This happens because fuse_notify_inval_entry() attempts to allocate memory
with GFP_KERNEL, holding "struct fuse_copy_state" mapped by kmap_atomic().Introduced by commit 58bda1da4b3c "fuse/dev: use atomic maps"
Fix by moving the map/unmap to just cover the actual memcpy operation.
Original patch from Maxim Patlasov
Reported-by: Richard Sharpe
Signed-off-by: Miklos Szeredi
Cc: # v3.15+ -
If the number in "user_id=N" or "group_id=N" mount options was larger than
INT_MAX then fuse returned EINVAL.Fix this to handle all valid uid/gid values.
Signed-off-by: Miklos Szeredi
Cc: stable@vger.kernel.org -
This patch removes the cast on data of type void * as it is not needed.
The following Coccinelle semantic patch was used for making the change:@r@
expression x;
void* e;
type T;
identifier f;
@@(
*((T *)e)
|
((T *)x)[...]
|
((T *)x)->f
|
- (T *)
e
)Signed-off-by: Himangi Saraogi
Acked-by: Julia Lawall
Signed-off-by: Miklos Szeredi -
The following test case demonstrates the bug:
sh# mount -t glusterfs localhost:meta-test /mnt/one
sh# mount -t glusterfs localhost:meta-test /mnt/two
sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; echo stuff > /mnt/one/file
bash: /mnt/one/file: Stale file handlesh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; sleep 1; echo stuff > /mnt/one/file
On the second open() on /mnt/one, FUSE would have used the old
nodeid (file handle) trying to re-open it. Gluster is returning
-ESTALE. The ESTALE propagates back to namei.c:filename_lookup()
where lookup is re-attempted with LOOKUP_REVAL. The right
behavior now, would be for FUSE to ignore the entry-timeout and
and do the up-call revalidation. Instead FUSE is ignoring
LOOKUP_REVAL, succeeding the revalidation (because entry-timeout
has not passed), and open() is again retried on the old file
handle and finally the ESTALE is going back to the application.Fix: if revalidation is happening with LOOKUP_REVAL, then ignore
entry-timeout and always do the up-call.Signed-off-by: Anand Avati
Reviewed-by: Niels de Vos
Signed-off-by: Miklos Szeredi
Cc: stable@vger.kernel.org -
As suggested by checkpatch.pl, use time_before64() instead of direct
comparison of jiffies64 values.Signed-off-by: Miklos Szeredi
Cc:
13 Jun, 2014
1 commit
-
Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...
05 Jun, 2014
2 commits
-
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.The primary APIs of concern in this care are the following and are used
by most filesystems.find_get_page
find_lock_page
find_or_create_page
grab_cache_page_nowait
grab_cache_page_write_beginAll of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.async dd
3.15.0-rc3 3.15.0-rc3
vanilla accessed-v2
ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.samples percentage
ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Peter Zijlstra
Tested-by: Prabhakar Lad
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 Jun, 2014
1 commit
-
Currently, the fl_owner isn't set for flock locks. Some filesystems use
byte-range locks to simulate flock locks and there is a common idiom in
those that does:fl->fl_owner = (fl_owner_t)filp;
fl->fl_start = 0;
fl->fl_end = OFFSET_MAX;Since flock locks are generally "owned" by the open file description,
move this into the common flock lock setup code. The fl_start and fl_end
fields are already set appropriately, so remove the unneeded setting of
that in flock ops in those filesystems as well.Finally, the lease code also sets the fl_owner as if they were owned by
the process and not the open file description. This is incorrect as
leases have the same ownership semantics as flock locks. Set them the
same way. The lease code doesn't actually use the fl_owner value for
anything, so this is more for consistency's sake than a bugfix.Reported-by: Trond Myklebust
Signed-off-by: Jeff Layton
Acked-by: Greg Kroah-Hartman (Staging portion)
Acked-by: J. Bruce Fields
07 May, 2014
13 commits
-
New variant of iov_iter - ITER_BVEC in iter->type, backed with
bio_vec array instead of iovec one. Primitives taught to deal
with such beasts, __swap_write() switched to using that kind
of iov_iter.Note that bio_vec is just a triple - there's
nothing block-specific about it. I've left the definition where it
was, but took it from under ifdef CONFIG_BLOCK.Next target: ->splice_write()...
Signed-off-by: Al Viro
-
Signed-off-by: Al Viro
-
Signed-off-by: Al Viro
-
Now It Can Be Done(tm) - we don't need to do iov_shorten() in
generic_file_direct_write() anymore, now that all ->direct_IO()
instances are converted to proper iov_iter methods and honour
iter->count and iter->iov_offset properly.Get rid of count/ocount arguments of generic_file_direct_write(),
while we are at it.Signed-off-by: Al Viro
-
counts the pages covered by iov_iter, up to given limit.
do_block_direct_io() and fuse_iter_npages() switched to
it.Signed-off-by: Al Viro
-
Signed-off-by: Al Viro
-
... to fuse_direct_{read,write}(). ->direct_IO() path uses the
iov_iter passed by the caller instead.Signed-off-by: Al Viro
-
For now, just use the same thing we pass to ->direct_IO() - it's all
iovec-based at the moment. Pass it explicitly to iov_iter_init() and
account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO()
uses.Signed-off-by: Al Viro
-
Signed-off-by: Al Viro
-
all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter)
Signed-off-by: Al Viro
-
unmodified, for now
Signed-off-by: Al Viro
-
all callers of ->aio_read() and ->aio_write() have iov/nr_segs already
checked - generic_segment_checks() done after that is just an odd way
to spell iov_length().Signed-off-by: Al Viro
-
Signed-off-by: Al Viro
28 Apr, 2014
7 commits
-
Support RENAME_EXCHANGE and RENAME_NOREPLACE flags on the userspace ABI.
Signed-off-by: Miklos Szeredi
-
Fuse doesn't support i_version (yet).
Signed-off-by: Miklos Szeredi
-
The patch addresses two use-cases when the flag may be safely cleared:
1. fuse_do_setattr() is called with ATTR_CTIME flag set in attr->ia_valid.
In this case attr->ia_ctime bears actual value. In-kernel fuse must send it
to the userspace server and then assign the value to inode->i_ctime.2. fuse_do_setattr() is called with ATTR_SIZE flag set in attr->ia_valid,
whereas ATTR_CTIME is not set (truncate(2)).
In this case in-kernel fuse must sent "now" to the userspace server and then
assign the value to inode->i_ctime.In both cases we could clear I_DIRTY_SYNC, but that needs more thought.
Signed-off-by: Maxim Patlasov
Signed-off-by: Miklos Szeredi -
Let the kernel maintain i_ctime locally: update i_ctime explicitly on
truncate, fallocate, open(O_TRUNC), setxattr, removexattr, link, rename,
unlink.The inode flag I_DIRTY_SYNC serves as indication that local i_ctime should
be flushed to the server eventually. The patch sets the flag and updates
i_ctime in course of operations listed above.Signed-off-by: Maxim Patlasov
Signed-off-by: Miklos Szeredi -
This implements updating ctime as well as mtime on file_update_time().
Signed-off-by: Miklos Szeredi
-
The patch extends fuse_setattr_in, and extends the flush procedure
(fuse_flush_times()) called on ->write_inode() to send the ctime as well as
mtime.Signed-off-by: Maxim Patlasov
Signed-off-by: Miklos Szeredi -
Allow userspace fs to specify time granularity.
This is needed because with writeback_cache mode the kernel is responsible
for generating mtime and ctime, but if the underlying filesystem doesn't
support nanosecond granularity then the cache will contain a different
value from the one stored on the filesystem resulting in a change of times
after a cache flush.Make the default granularity 1s.
Signed-off-by: Miklos Szeredi