08 Apr, 2020
2 commits
-
Christoph Hellwig noticed that we were doing some unnecessary
work in orangefs_flush:orangefs_flush just writes out data on every close(2) call. There is
no need to change anything about the dirty state, especially as
orangefs doesn't treat I_DIRTY_TIMES special in any way. The code
seems to come from partially open coding vfs_fsync.He sent in a patch with the above commit message and also a
patch that was a reversion of another Orangefs patch I had
sent upstream a while ago. I had to fix his reversion patch
so that it would compile which caused his "don't mess with
I_DIRTY_TIMES" patch to fail to apply. So here I have just
remade his patch and applied it after the fixed reversion patch.Signed-off-by: Mike Marshall
-
Christoph Hellwig sent in a reversion of "orangefs: remember count
when reading." because:->read_iter calls can race with each other and one or
more ->flush calls. Remove the the scheme to store the read
count in the file private data as is is completely racy and
can cause use after free or double free conditionsChristoph's reversion caused Orangefs not to work or to compile. I
added a patch that fixed that, but intel's kbuild test robot pointed
out that sending Christoph's patch followed by my patch upstream, it
would break bisection because of the failure to compile. So I have
combined the reversion plus my patch... here's the commit message
that was in my patch:Logically, optimal Orangefs "pages" are 4 megabytes. Reading
large Orangefs files 4096 bytes at a time is like trying to
kick a dead whale down the beach. Before Christoph's "Revert
orangefs: remember count when reading." I tried to give users
a knob whereby they could, for example, use "count" in
read(2) or bs with dd(1) to get whatever they considered an
appropriate amount of bytes at a time from Orangefs and fill
as many page cache pages as they could at once.Without the racy code that Christoph reverted Orangefs won't
even compile, much less work. So this replaces the logic that
used the private file data that Christoph reverted with
a static number of bytes to read from Orangefs.I ran tests like the following to determine what a
reasonable static number of bytes might be:dd if=/pvfsmnt/asdf of=/dev/null count=128 bs=4194304
dd if=/pvfsmnt/asdf of=/dev/null count=256 bs=2097152
dd if=/pvfsmnt/asdf of=/dev/null count=512 bs=1048576
.
.
.
dd if=/pvfsmnt/asdf of=/dev/null count=4194304 bs=128Reads seem faster using the static number, so my "knob code"
wasn't just racy, it wasn't even a good idea...Signed-off-by: Mike Marshall
Reported-by: kbuild test robot
04 Dec, 2019
1 commit
-
Orangefs has no open, and orangefs checks file permissions
on each file access. Posix requires that file permissions
be checked on open and nowhere else. Orangefs-through-the-kernel
needs to seem posix compliant.The VFS opens files, even if the filesystem provides no
method. We can see if a file was successfully opened for
read and or for write by looking at file->f_mode.When writes are flowing from the page cache, file is no
longer available. We can trust the VFS to have checked
file->f_mode before writing to the page cache.The mode of a file might change between when it is opened
and IO commences, or it might be created with an arbitrary mode.We'll make sure we don't hit EACCES during the IO stage by
using UID 0. Some of the time we have access without changing
to UID 0 - how to check?Signed-off-by: Mike Marshall
01 Aug, 2019
1 commit
-
There are 3 remaining files without an extension inside the fs docs
dir.Manually convert them to ReST.
In the case of the nfs/exporting.rst file, as the nfs docs
aren't ported yet, I opted to convert and add a :orphan: there,
with should be removed when it gets added into a nfs-specific
part of the fs documentation.Signed-off-by: Mauro Carvalho Chehab
Signed-off-by: Jonathan Corbet
17 Jul, 2019
1 commit
-
Pull orangefs updates from Mike Marshall:
"Two small fixes.This is just a fix for an unused value that Colin King sent me and a
related fix I added"* tag 'for-linus-5.3-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: eliminate needless variable assignments
orangefs: remove redundant assignment to variable buffer_index
12 Jul, 2019
2 commits
-
Signed-off-by: Mike Marshall
-
The variable buffer_index is being initialized however this is never
read and later it is being reassigned to a new value. The initialization
is redundant and hence can be removed.Addresses-Coverity: ("Unused Value")
Signed-off-by: Colin Ian King
Signed-off-by: Mike Marshall
01 Jul, 2019
1 commit
-
Create a generic function to check incoming FS_IOC_SETFLAGS flag values
and later prepare the inode for updates so that we can standardize the
implementations that follow ext4's flag values.Note that the efivarfs implementation no longer fails a no-op SETFLAGS
without CAP_LINUX_IMMUTABLE since that's the behavior in ext*.Signed-off-by: Darrick J. Wong
Reviewed-by: Jan Kara
Reviewed-by: Christoph Hellwig
Acked-by: David Sterba
Reviewed-by: Bob Peterson
04 May, 2019
13 commits
-
->readpage looks in file->private_data to try and find out how the
userspace program set "count" in read(2) or with "dd bs=" or whatever.->readpage uses "count" and inode->i_size to calculate how much
data Orangefs should deposit in the Orangefs shared buffer, and
remembers which slot the data is in.After copying data from the Orangefs shared buffer slot into
"the page", readpage tries to increment through the pagecache index
and fill as many pages as it can from the extra data in the shared
buffer. Hopefully these extra pages will soon be needed by the vfs,
and they'll be in the pagecache already.Signed-off-by: Mike Marshall
Signed-off-by: Martin Brandenburg -
When userspace deposits more than a page of data into the shared buffer,
we'll need to know which slot it is in when we get back to readpage
so that we can try to use the extra data to fill some extra pages.Signed-off-by: Mike Marshall
Signed-off-by: Martin Brandenburg -
Orangefs wins when it can do IO on large (up to four meg) blocks at a time,
and looses when it has to do tiny "small io" reads and writes. Accessing
Orangefs through the pagecache with the kernel module helps with small io,
both reading and writing, a great deal. Readpage generally tries to fetch a
page (four k) at a time. We'll let users use "count" (as in read(2) or
pread(2) for example) as a knob to control how much data they get from
Orangefs at a time and we'll try to use the data to fill extra
pagecache pages when we get to ->readpage, hopefully resulting in
fewer calls to readpage and Orangefs userspace.We need a way to remember how they set count so that we can still have
it available when we get to ->readpage.- We'll use file->private_data to keep track of "count".
We'll wrap generic_file_open with orangefs_file_open and
initialize private_data to NULL there.- In ->read_iter we have access to both "count" and file, so
we'll kmalloc some space onto file->private_data and store
"count" there.- We'll kfree file->private_data each time we visit ->flush and
reinitialize it to NULL.Signed-off-by: Mike Marshall
Signed-off-by: Martin Brandenburg -
This is modeled after NFS, except our method is different. We use a
simple timer to determine whether to invalidate the page cache. This
is bound to perform.This addes a sysfs parameter cache_timeout_msecs which controls the time
between page cache invalidations.Signed-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall -
Go through pages and look for a consecutive writable region. After
finding a number of consecutive writable pages or when finding that
the next page's dirty range is not contiguous and cannot be written
as one request, send the write to the server.The number of pages is determined by the client-core's buffer size.
Signed-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall -
Attach the actual range of bytes written to plus the responsible uid/gid
to each dirty page. This information must be sent to the server when
the page is written out.Now write_begin, page_mkwrite, and invalidatepage keep up with this
information. There are several conditions where they must write out the
page immediately to store the new range. Two non-contiguous ranges
cannot be stored on a single page.Signed-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall -
Without this, an fsync call is sent to the server even if no data
changed. This resulted in a rather severe (50%) performance regression
under certain metadata-heavy workloads.In the past, everything was direct IO. Nothing happend on a close call.
An explicit fsync call would send an fsync request to the server which
in turn fsynced the underlying file.Now there are cached writes. Then fsync began writing out dirty pages
in addition to making an fsync request to the server, and close began
calling fsync.With this commit, close only writes out dirty pages, and does not make
the fsync request.Signed-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall -
direct_IO was the only caller and all direct_IO did was call it,
so there's no use in having the code spread out into so many functions.Signed-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall -
Signed-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall -
Now orangefs_inode_getattr fills from cache if an inode has dirty pages.
also if attr_valid and dirty pages and !flags, we spin on inode writeback
before returning if pages still dirty after: should it be other waySigned-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall -
Remove orangefs_inode_read. It was used by readpage. Calling
wait_for_direct_io directly serves the purpose just as well. There is
now no check of the bufmap size in the readpage path. There are already
other places the bufmap size is assumed to be greater than PAGE_SIZE.Important to call truncate_inode_pages now in the write path so a
subsequent read sees the new data.Signed-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall -
This should be a no-op now, but once inode writeback works, it'll be
necessary to have the correct attribute in the dirty inode.Previously the attribute fetch timeout was marked invalid and the server
provided the updated attribute. When the inode is dirty, the server
cannot be consulted since it does not yet know the pending setattr.Signed-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall -
No need to store the received mask. It is either STATX_BASIC_STATS or
STATX_BASIC_STATS & ~STATX_SIZE. If STATX_SIZE is requested, the cache
is bypassed anyway, so the cached mask is unnecessary to decide whether
to do a real getattr.This is a change. Previously a getattr would want size and use the
cached size. All of the in-kernel callers that wanted size did not want
a cached size. Now a getattr cannot use the cached size if it wants
size at all.Signed-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall
21 Feb, 2019
1 commit
-
Signed-off-by: Mike Marshall
15 Aug, 2018
1 commit
-
Use new return type vm_fault_t for fault handler. For now,
this is just documenting that the function returns a VM_FAULT
value rather than an errno. Once all instances are converted,
vm_fault_t will become a distinct type.See the following
commit 1c8f422059ae ("mm: change return type to vm_fault_t")Fixed checkpatch.pl warning.
Signed-off-by: Souptick Joarder
Signed-off-by: Mike Marshall
02 Jun, 2018
2 commits
-
Signed-off-by: Mike Marshall
-
The struct orangefs_file_vm_ops is local to the source and does not need
to be in global scope, so make it static.Cleans up sparse warning:
fs/orangefs/file.c:547:35: warning: symbol 'orangefs_file_vm_ops' was not
declared. Should it be static?Signed-off-by: Colin Ian King
Signed-off-by: Mike Marshall
04 Apr, 2018
1 commit
-
Must retrieve size before running filemap_fault so the kernel has
an up-to-date size.This should have been caught by xfstests generic/246, but it was masked
by orangefs_new_inode, which set i_size to PAGE_SIZE. When nothing
caused a getattr prior to a pagefault, i_size was still PAGE_SIZE.
Since xfstests only read 10 bytes, it did not catch this bug.When orangefs_new_inode was modified to perform a getattr instead,
i_size was set to zero, as it was a newly created file. Then
orangefs_file_write_iter did NOT set i_size. Instead it invalidated the
attribute cache, which should have caused the next caller to retrieve
i_size. But the fault handler did not know it was supposed to retrieve
i_size. So during xfstests, i_size was still zero, and filemap_fault
returned VM_FAULT_SIGBUS.Fixes xfstests generic/452.
Signed-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall
02 Apr, 2018
1 commit
-
Signed-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall
26 Jan, 2018
1 commit
-
After do_readv_writev, the inode cache is invalidated anyway, so i_size
will never be read. It will be fetched from the server which will also
know about updates from other machines.Fixes deadlock on 32-bit SMP.
See https://marc.info/?l=linux-fsdevel&m=151268557427760&w=2
Signed-off-by: Martin Brandenburg
Cc: Al Viro
Cc: Mike Marshall
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds
14 Nov, 2017
1 commit
-
The previous code path was to mark the inode dirty, let
orangefs_inode_dirty set a flag in our private inode, then later during
inode release call orangefs_flush_inode which notices the flag and
writes the atime out.The code path worked almost identically for mtime, ctime, and mode
except that those flags are set explicitly and not as side effects of
dirty.Now orangefs_flush_inode is removed. Marking an inode dirty does not
imply an atime update. Any place where flags were set before is now
an explicit call to orangefs_inode_setattr. Since OrangeFS does not
utilize inode writeback, the attribute change should be written out
immediately.Fixes generic/120.
In namei.c, there are several places where the directory mtime and ctime
are set, but only the mtime is sent to the server. These don't seem
right, but I've left them as is for now.Signed-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall
02 Nov, 2017
1 commit
-
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
15 Sep, 2017
1 commit
-
Orangefs doesn't do buffered writes yet, so there's no point in
initiating and waiting for writeback.Signed-off-by: Jeff Layton
Signed-off-by: Mike Marshall
06 May, 2017
1 commit
-
Pull orangefs updates from Mike Marshall:
"Orangefs cleanups, fixes and statx support.Some cleanups:
- remove unused get_fsid_from_ino
- fix bounds check for listxattr
- clean up oversize xattr validation
- do not set getattr_time on orangefs_lookup
- return from orangefs_devreq_read quickly if possible
- do not wait for timeout if umounting
- handle zero size write in debugfsBug fixes:
- do not check possibly stale size on truncate
- ensure the userspace component is unmounted if mount fails
- total reimplementation of dir.cNew feature:
- implement statx
The new implementation of dir.c is kind of a big deal, all new code.
It has been posted to fs-devel during the previous rc period, we
didn't get much review or feedback from there, but it has been
reviewed very heavily here, so much so that we have two entire
versions of the reimplementation.Not only does the new implementation fix some xfstests, but it passes
all the new tests we made here that involve seeking and rewinding and
giant directories and long file names. The new dir code has three
patches itself:- skip forward to the next directory entry if seek is short
- invalidate stored directory on seek
- count directory pieces correctly"* tag 'for-linus-4.12-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: count directory pieces correctly
orangefs: invalidate stored directory on seek
orangefs: skip forward to the next directory entry if seek is short
orangefs: handle zero size write in debugfs
orangefs: do not wait for timeout if umounting
orangefs: return from orangefs_devreq_read quickly if possible
orangefs: ensure the userspace component is unmounted if mount fails
orangefs: do not check possibly stale size on truncate
orangefs: implement statx
orangefs: remove ORANGEFS_READDIR macros
orangefs: support very large directories
orangefs: support llseek on directories
orangefs: rewrite readdir to fix several bugs
orangefs: do not set getattr_time on orangefs_lookup
orangefs: clean up oversize xattr validation
orangefs: fix bounds check for listxattr
orangefs: remove unused get_fsid_from_ino
27 Apr, 2017
1 commit
-
Fortunately OrangeFS has had a getattr request mask for a long time.
The server basically has two difficulty levels for attributes. Fetching
any attribute except size requires communicating with the metadata
server for that handle. Since all the attributes are right there, it
makes sense to return them all. Fetching the size requires
communicating with every I/O server (that the file is distributed
across). Therefore if asked for anything except size, get everything
except size, and if asked for size, get everything.Signed-off-by: Martin Brandenburg
Signed-off-by: Mike Marshall
22 Apr, 2017
1 commit
-
Signed-off-by: Al Viro
05 Dec, 2016
1 commit
-
Signed-off-by: Al Viro
25 Oct, 2016
1 commit
-
Replace wrong use of file->f_path.dentry->d_inode with file_inode(file).
In case orangefs ever finds itself as an overelayfs layer, it would want
to get its own inode and not overlayfs's inode.DISCLAIMER: I did not test this patch because I do not know how to setup
an orangefs mountSigned-off-by: Amir Goldstein
Signed-off-by: Mike Marshall
11 Oct, 2016
2 commits
-
Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning -
Pull misc vfs updates from Al Viro:
"Assorted misc bits and pieces.There are several single-topic branches left after this (rename2
series from Miklos, current_time series from Deepa Dinamani, xattr
series from Andreas, uaccess stuff from from me) and I'd prefer to
send those separately"* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
proc: switch auxv to use of __mem_open()
hpfs: support FIEMAP
cifs: get rid of unused arguments of CIFSSMBWrite()
posix_acl: uapi header split
posix_acl: xattr representation cleanups
fs/aio.c: eliminate redundant loads in put_aio_ring_file
fs/internal.h: add const to ns_dentry_operations declaration
compat: remove compat_printk()
fs/buffer.c: make __getblk_slow() static
proc: unsigned file descriptors
fs/file: more unsigned file descriptors
fs: compat: remove redundant check of nr_segs
cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
cifs: don't use memcpy() to copy struct iov_iter
get rid of separate multipage fault-in primitives
fs: Avoid premature clearing of capabilities
fs: Give dentry to inode_change_ok() instead of inode
fuse: Propagate dentry down to inode_change_ok()
ceph: Propagate dentry down to inode_change_ok()
xfs: Propagate dentry down to inode_change_ok()
...
29 Sep, 2016
1 commit
-
Pull in an OrangeFS branch containing miscellaneous improvements.
- clean up debugfs globals
- remove dead code in sysfs
- reorganize duplicated sysfs attribute structs
- consolidate sysfs show and store functions
- remove duplicated sysfs_ops structures
- describe organization of sysfs
- make devreq_mutex static
- g_orangefs_stats -> orangefs_stats for consistency
- rename most remaining global variables
28 Sep, 2016
1 commit
-
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.CURRENT_TIME is also not y2038 safe.
This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.Signed-off-by: Deepa Dinamani
Reviewed-by: Arnd Bergmann
Acked-by: Felipe Balbi
Acked-by: Steven Whitehouse
Acked-by: Ryusuke Konishi
Acked-by: David Sterba
Signed-off-by: Al Viro