08 Jul, 2017
2 commits
-
Pull Writeback error handling updates from Jeff Layton:
"This pile represents the bulk of the writeback error handling fixes
that I have for this cycle. Some of the earlier patches in this pile
may look trivial but they are prerequisites for later patches in the
series.The aim of this set is to improve how we track and report writeback
errors to userland. Most applications that care about data integrity
will periodically call fsync/fdatasync/msync to ensure that their
writes have made it to the backing store.For a very long time, we have tracked writeback errors using two flags
in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
writeback error occurs (via mapping_set_error) and are cleared as a
side-effect of filemap_check_errors (as you noted yesterday). This
model really sucks for userland.Only the first task to call fsync (or msync or fdatasync) will see the
error. Any subsequent task calling fsync on a file will get back 0
(unless another writeback error occurs in the interim). If I have
several tasks writing to a file and calling fsync to ensure that their
writes got stored, then I need to have them coordinate with one
another. That's difficult enough, but in a world of containerized
setups that coordination may even not be possible.But wait...it gets worse!
The calls to filemap_check_errors can be buried pretty far down in the
call stack, and there are internal callers of filemap_write_and_wait
and the like that also end up clearing those errors. Many of those
callers ignore the error return from that function or return it to
userland at nonsensical times (e.g. truncate() or stat()). If I get
back -EIO on a truncate, there is no reason to think that it was
because some previous writeback failed, and a subsequent fsync() will
(incorrectly) return 0.This pile aims to do three things:
1) ensure that when a writeback error occurs that that error will be
reported to userland on a subsequent fsync/fdatasync/msync call,
regardless of what internal callers are doing2) report writeback errors on all file descriptions that were open at
the time that the error occurred. This is a user-visible change,
but I think most applications are written to assume this behavior
anyway. Those that aren't are unlikely to be hurt by it.3) document what filesystems should do when there is a writeback
error. Today, there is very little consistency between them, and a
lot of cargo-cult copying. We need to make it very clear what
filesystems should do in this situation.To achieve this, the set adds a new data type (errseq_t) and then
builds new writeback error tracking infrastructure around that. Once
all of that is in place, we change the filesystems to use the new
infrastructure for reporting wb errors to userland.Note that this is just the initial foray into cleaning up this mess.
There is a lot of work remaining here:1) convert the rest of the filesystems in a similar fashion. Once the
initial set is in, then I think most other fs' will be fairly
simple to convert. Hopefully most of those can in via individual
filesystem trees.2) convert internal waiters on writeback to use errseq_t for
detecting errors instead of relying on the AS_* flags. I have some
draft patches for this for ext4, but they are not quite ready for
prime time yet.This was a discussion topic this year at LSF/MM too. If you're
interested in the gory details, LWN has some good articles about this:https://lwn.net/Articles/718734/
https://lwn.net/Articles/724307/"* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
btrfs: minimal conversion to errseq_t writeback error reporting on fsync
xfs: minimal conversion to errseq_t writeback error reporting
ext4: use errseq_t based error handling for reporting data writeback errors
fs: convert __generic_file_fsync to use errseq_t based reporting
block: convert to errseq_t based writeback error tracking
dax: set errors in mapping when writeback fails
Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
fs: new infrastructure for writeback error handling and reporting
lib: add errseq_t type and infrastructure for handling it
mm: don't TestClearPageError in __filemap_fdatawait_range
mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
jbd2: don't clear and reset errors after waiting on writeback
buffer: set errors in mapping at the time that the error occurs
fs: check for writeback errors after syncing out buffers in generic_file_fsync
buffer: use mapping_set_error instead of setting the flag
mm: fix mapping_set_error call in me_pagecache_dirty -
Before commit 88ffbf3e03 "GFS2: Use resizable hash table for glocks",
glocks were freed via call_rcu to allow reading the glock hashtable
locklessly using rcu. This was then changed to free glocks immediately,
which made reading the glock hashtable unsafe. Bring back the original
code for freeing glocks via call_rcu.Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson
Cc: stable@vger.kernel.org # 4.3+
06 Jul, 2017
2 commits
-
I noticed on xfs that I could still sometimes get back an error on fsync
on a fd that was opened after the error condition had been cleared.The problem is that the buffer code sets the write_io_error flag and
then later checks that flag to set the error in the mapping. That flag
perisists for quite a while however. If the file is later opened with
O_TRUNC, the buffers will then be invalidated and the mapping's error
set such that a subsequent fsync will return error. I think this is
incorrect, as there was no writeback between the open and fsync.Add a new mark_buffer_write_io_error operation that sets the flag and
the error in the mapping at the same time. Replace all calls to
set_buffer_write_io_error with mark_buffer_write_io_error, and remove
the places that check this flag in order to set the error in the
mapping.This sets the error in the mapping earlier, at the time that it's first
detected.Signed-off-by: Jeff Layton
Reviewed-by: Jan Kara
Reviewed-by: Carlos Maiolino -
Pull GFS2 updates from Bob Peterson:
"We've got eight GFS2 patches for this merge window:- Andreas Gruenbacher has four patches related to cleaning up the
GFS2 inode evict process. This is about half of his patches
designed to fix a long-standing GFS2 hang related to the inode
shrinker: Shrinker calls gfs2 evict, evict calls DLM, DLM requires
memory and blocks on the shrinker.These four patches have been well tested. His second set of patches
are still being tested, so I plan to hold them until the next merge
window, after we have more weeks of testing. The first patch
eliminates the flush_delayed_work, which can block.- Andreas's second patch protects setting of gl_object for rgrps with
a spin_lock to prevent proven races.- His third patch introduces a centralized mechanism for queueing
glock work with better reference counting, to prevent more races.-His fourth patch retains a reference to inode glocks when an error
occurs while creating an inode. This keeps the subsequent evict
from needing to reacquire the glock, which might call into DLM and
block in low memory conditions.- Arvind Yadav has a patch to add const to attribute_group
structures.- I have a patch to detect directory entry inconsistencies and
withdraw the file system if any are found. Better that than silent
corruption.- I have a patch to remove a vestigial variable from glock
structures, saving some slab space.- I have another patch to remove a vestigial variable from the GFS2
in-core superblock structure"* tag 'gfs2-4.13.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: constify attribute_group structures.
gfs2: gfs2_create_inode: Keep glock across iput
gfs2: Clean up glock work enqueuing
gfs2: Protect gl->gl_object by spin lock
gfs2: Get rid of flush_delayed_work in gfs2_evict_inode
GFS2: Eliminate vestigial sd_log_flush_wrapped
GFS2: Remove gl_list from glock structure
GFS2: Withdraw when directory entry inconsistencies are detected
05 Jul, 2017
5 commits
-
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by work with const
attribute_group. So mark the non-const structs as const.File size before:
text data bss dec hex filename
5259 1344 8 6611 19d3 fs/gfs2/sys.oFile size After adding 'const':
text data bss dec hex filename
5371 1216 8 6595 19c3 fs/gfs2/sys.oSigned-off-by: Arvind Yadav
Signed-off-by: Bob Peterson -
On failure, keep the inode glock across the final iput of the new inode
so that gfs2_evict_inode doesn't have to re-acquire the glock. That
way, gfs2_evict_inode won't need to revalidate the block type.Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson -
This patch adds a standardized queueing mechanism for glock work
with spin_lock protection to prevent races.Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson -
Put all remaining accesses to gl->gl_object under the
gl->gl_lockref.lock spinlock to prevent races.Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson -
So far, gfs2_evict_inode clears gl->gl_object and then flushes the glock
work queue to make sure that inode glops which dereference gl->gl_object
have finished running before the inode is destroyed. However, flushing
the work queue may do more work than needed, and in particular, it may
call into DLM, which we want to avoid here. Use a bit lock
(GIF_GLOP_PENDING) to synchronize between the inode glops and
gfs2_evict_inode instead to get rid of the flushing.In addition, flush the work queues of existing glocks before reusing
them for new inodes to get those glocks into a known state: the glock
state engine currently doesn't handle glock re-appropriation correctly.
(We may be able to fix the glock state engine instead later.)Based on a patch by Steven Whitehouse .
Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson
20 Jun, 2017
1 commit
-
Superblock variable sd_log_flush_wrapped is set, but never referenced,
so this patch eliminates it.Signed-off-by: Bob Peterson
13 Jun, 2017
3 commits
-
The gl_list is no longer used nor needed in the glock structure,
so this patch eliminates it.Signed-off-by: Bob Peterson
-
This patch prints an inode consistency error and withdraws the file
system when directory entry counts are mismatched.Signed-off-by: Bob Peterson
12 Jun, 2017
1 commit
-
We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.Signed-off-by: Jens Axboe
09 Jun, 2017
2 commits
-
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Signed-off-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe
05 Jun, 2017
1 commit
-
For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers. More to come..Signed-off-by: Christoph Hellwig
Reviewed-by: Amir Goldstein
Acked-by: Mimi Zohar (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko
24 May, 2017
1 commit
-
Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions. generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressionsFix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3
CC: Steven Whitehouse
CC: cluster-devel@redhat.com
CC: stable@vger.kernel.org
Acked-by: Bob Peterson
Signed-off-by: Jan Kara
09 May, 2017
1 commit
-
Link: http://lkml.kernel.org/r/20170420161852.0492bc3f@canb.auug.org.au
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 May, 2017
2 commits
-
Pull GFS2 updates from Bob Peterson:
"We've got ten GFS2 patches for this merge window.- Andreas Gruenbacher wrote a patch to replace the deprecated call to
rhashtable_walk_init with rhashtable_walk_enter.- Andreas also wrote a patch to eliminate redundant code in two of
our debugfs sequence files.- Andreas also cleaned up the rhashtable key ugliness Linus pointed
out during this cycle, following Linus's suggestions.- Andreas also wrote a patch to take advantage of his new function
rhashtable_lookup_get_insert_fast. This makes glock lookup faster
and more bullet-proof.- Andreas also wrote a patch to revert a patch in the evict path that
caused occasional deadlocks, and is no longer needed.- Andrew Price wrote a patch to re-enable fallocate for the rindex
system file to enable gfs2_grow to grow properly on secondary file
system grow operations.- I wrote a patch to initialize an inode number field to make certain
kernel trace points more understandable.- I also wrote a patch that makes GFS2 file system "withdraw" work
more like it should by ignoring operations after a withdraw that
would formerly cause a BUG() and kernel panic.- I also reworked the entire truncate/delete algorithm, scrapping the
old recursive algorithm in favor of a new non-recursive algorithm.
This was done for performance: This way, GFS2 no longer needs to
lock multiple resource groups while doing truncates and deletes of
files that cross multiple resource group boundaries, allowing for
better parallelism. It also solves a problem whereby deleting large
files would request a large chunk of kernel memory, which resulted
in a get_page_from_freelist warning.- Due to a regression found during testing, I added a new patch to
correct 'GFS2: Prevent BUG from occurring when normal Withdraws
occur'."* tag 'gfs2-4.12.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: Allow glocks to be unlocked after withdraw
GFS2: Non-recursive delete
gfs2: Re-enable fallocate for the rindex
Revert "GFS2: Wait for iopen glock dequeues"
gfs2: Switch to rhashtable_lookup_get_insert_fast
GFS2: Temporarily zero i_no_addr when creating a dinode
gfs2: Don't pack struct lm_lockname
gfs2: Deduplicate gfs2_{glocks,glstats}_open
gfs2: Replace rhashtable_walk_init with rhashtable_walk_enter
GFS2: Prevent BUG from occurring when normal Withdraws occur -
This bug fixes a regression introduced by patch 0d1c7ae9d8.
The intent of the patch was to stop promoting glocks after a
file system is withdrawn due to a variety of errors, because doing
so results in a BUG(). (You should be able to unmount after a
withdraw rather than having the kernel panic.)Unfortunately, it also stopped demotions, so glocks could not be
unlocked after withdraw, which means the unmount would hang.This patch allows function do_xmote to demote locks to an
unlocked state after a withdraw, but not promote them.Signed-off-by: Bob Peterson
21 Apr, 2017
2 commits
-
Now that all bdi structures filesystems use are properly refcounted, we
can remove the SB_I_DYNBDI flag.Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe -
Similarly to set_bdev_super() GFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.CC: Steven Whitehouse
CC: Bob Peterson
CC: cluster-devel@redhat.com
Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe
19 Apr, 2017
1 commit
-
Implement truncate/delete as a non-recursive algorithm. The older
algorithm was implemented with recursion to strip off each layer
at a time (going by height, starting with the maximum height.
This version tries to do the same thing but without recursion,
and without needing to allocate new structures or lists in memory.For example, say you want to truncate a very large file to 1 byte,
and its end-of-file metapath is: 0.505.463.428. The starting
metapath would be 0.0.0.0. Since it's a truncate to non-zero, it
needs to preserve that byte, and all metadata pointing to it.
So it would start at 0.0.0.0, look up all its metadata buffers,
then free all data blocks pointed to at the highest level.
After that buffer is "swept", it moves on to 0.0.0.1, then
0.0.0.2, etc., reading in buffers and sweeping them clean.
When it gets to the end of the 0.0.0 metadata buffer (for 4K
blocks the last valid one is 0.0.0.508), it backs up to the
previous height and starts working on 0.0.1.0, then 0.0.1.1,
and so forth. After it reaches the end and sweeps 0.0.1.508,
it continues with 0.0.2.0, and so on. When that height is
exhausted, and it reaches 0.0.508.508 it backs up another level,
to 0.1.0.0, then 0.1.0.1, through 0.1.0.508. So it has to keep
marching backwards and forwards through the metadata until it's
all swept clean. Once it has all the data blocks freed, it
lowers the strip height, and begins the process all over again,
but with one less height. This time it sweeps 0.0.0 through
0.505.463. When that's clean, it lowers the strip height again
and works to free 0.505. Eventually it strips the lowest height, 0.
For a delete or truncate to 0, all metadata for all heights of
0.0.0.0 would be freed. For a truncate to 1 byte, 0.0.0.0 would
be preserved.This isn't much different from normal integer incrementing,
where an integer gets incremented from 0000 (0.0.0.0) to 3021
(3.0.2.1). So 0000 gets increments to 0001, 0002, up to 0009,
then on to 0010, 0011 up to 0099, then 0100 and so forth. It's
just that each "digit" goes from 0 to 508 (for a total of 509
pointers) rather than from 0 to 9.Note that the dinode will only have 483 pointers due to the
dinode structure itself.Also note: this is just an example. These numbers (509 and 483)
are based on a standard 4K block size. Smaller block sizes will
yield smaller numbers of indirect pointers accordingly.The truncation process is accomplished with the help of two
major functions and a few helper functions.Functions do_strip and recursive_scan are obsolete, so removed.
New function sweep_bh_for_rgrps cleans a buffer_head pointed to
by the given metapath and height. By cleaning, I mean it frees
all blocks starting at the offset passed in metapath. It starts
at the first block in the buffer pointed to by the metapath and
identifies its resource group (rgrp). From there it frees all
subsequent block pointers that lie within that rgrp. If it's
already inside a transaction, it stays within it as long as it
can. In other words, it doesn't close a transaction until it knows
it's freed what it can from the resource group. In this way,
multiple buffers may be cleaned in a single transaction, as long
as those blocks in the buffer all lie within the same rgrp.If it's not in a transaction, it starts one. If the buffer_head
has references to blocks within multiple rgrps, it frees all the
blocks inside the first rgrp it finds, then closes the
transaction. Then it repeats the cycle: identifies the next
unfreed block, uses it to find its rgrp, then starts a new
transaction for that set. It repeats this process repeatedly
until the buffer_head contains no more references to any blocks
past the given metapath.Function trunc_dealloc has been reworked into a finite state
automaton. It has basically 3 active states:
DEALLOC_MP_FULL, DEALLOC_MP_LOWER, and DEALLOC_FILL_MP:The DEALLOC_MP_FULL state implies the metapath has a full set
of buffers out to the "shrink height", and therefore, it can
call function sweep_bh_for_rgrps to free the blocks within the
highest height of the metapath. If it's just swept the lowest
level (or an error has occurred) the state machine is ended.
Otherwise it proceeds to the DEALLOC_MP_LOWER state.The DEALLOC_MP_LOWER state implies we are finished with a given
buffer_head, which may now be released, and therefore we are
then missing some buffer information from the metapath. So we
need to find more buffers to read in. In most cases, this is
just a matter of releasing the buffer_head and moving to the
next pointer from the previous height, so it may be read in and
swept as well. If it can't find another non-null pointer to
process, it checks whether it's reached the end of a height
and needs to lower the strip height, or whether it still needs
move forward through the previous height's metadata. In this
state, all zero-pointers are skipped. From this state, it can
only loop around (once more backing up another height) or,
once a valid metapath is found (one that has non-zero
pointers), proceed to state DEALLOC_FILL_MP.The DEALLOC_FILL_MP state implies that we have a metapath
but not all its buffers are read in. So we must proceed to read
in buffer_heads until the metapath has a valid buffer for every
height. If the previous state backed us up 3 heights, we may
need to read in a buffer, increment the height, then repeat the
process until buffers have been read in for all required heights.
If it's successful reading a buffer, and it's at the highest
height we need, it proceeds back to the DEALLOC_MP_FULL state.
If it's unable to fill in a buffer, (encounters a hole, etc.)
it tries to find another non-zero block pointer. If they're all
zero, it lowers the height and returns to the DEALLOC_MP_LOWER
state. If it finds a good non-null pointer, it loops around and
reads it in, while keeping the metapath in lock-step with the
pointers it examines.The state machine runs until the truncation request is
satisfied. Then any transactions are ended, the quota and
statfs data are updated, and the function is complete.Helper function metaptr1 was introduced to be an easy way to
determine the start of a buffer_head's indirect pointers.Helper function lookup_mp_height was introduced to find a
metapath index and read in the buffer that corresponds to it.
In this way, function lookup_metapath becomes a simple loop to
call it for every height.Helper function fillup_metapath is similar to lookup_metapath
except it can do partial lookups. If the state machine
backed up multiple levels (like 2999 wrapping to 3000) it
needs to find out the next starting point and start issuing
metadata reads at that point.Helper function hptrs is a shortcut to determine how many
pointers should be expected in a buffer. Height 0 is the dinode
which has fewer pointers than the others.Signed-off-by: Bob Peterson
05 Apr, 2017
1 commit
-
Commit 86066914edff2316cbed63aac8a87d5001441a16 "gfs2: Don't support
fallocate on jdata files" removed the ability of gfs2_grow to reserve
space at the end of the rindex, which could prevent a second gfs2_grow
from succeeding if the fs is full. Allow fallocate to work on the rindex
once again.Signed-off-by: Andrew Price
Signed-off-by: Bob Peterson
03 Apr, 2017
2 commits
-
Revert commit 86d067a797d4e8546a7c92b985f31e8cd3ec39ad: it turns out
that waiting for iopen glock dequeues here isn't needed anymore because
the bugs that commit was meant to fix have been fixed otherwise.In addition, we want to avoid waiting on glocks in gfs2_evict_inode in
shrinker context because the shrinker may be invoked on behalf of DLM,
in which case calling into DLM again would deadlock. This commit makes
the described scenario less likely without completely avoiding it; it's
still a step in the right direction, though.Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson -
Switch from rhashtable_lookup_insert_fast + rhashtable_lookup_fast to
rhashtable_lookup_get_insert_fast, which is cleaner and avoids an extra
rhashtable lookup.At the same time, turn the retry loop in gfs2_glock_get into an infinite
loop. The lookup or insert will eventually succeed, usually very fast,
but there is no reason to give up trying at a fixed number of
iterations.Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson
17 Mar, 2017
1 commit
-
Before this patch i_no_addr was not initialized until after the
return from allocating its block. That meant the i_no_addr was
temporarily uninitialized storage. Ordinarily that's not a concern,
but if inplace_reserve can't find space, it can call try_rgrp_unlink
which references i_no_addr as a block to avoid. That can result in
unpredictable behavior. More importantly, the trace point in
gfs2_alloc_blocks references ip->i_no_addr before it is set, which
is misleading when reading the kernel traces. This patch makes it
look like the new dinode block was assigned in the name of inode 0
rather than a random inode that's completely unrelated.Signed-off-by: Bob Peterson
16 Mar, 2017
4 commits
-
As per a suggestion by Linus, don't pack struct lm_lockname: we did that
because the struct is used as a rhashtable key, but packing tells the
compiler that the 64-bit fields in the struct may be unaligned, causing
it to generate worse code on some architectures. Instead, rearrange the
fields in the struct so that there is no padding between fields, and
exclude any tail padding from the hash key size.Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson -
Both functions are identical except for the seq_operations used.
Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson -
Function rhashtable_walk_init is deprecated.
Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson -
When the GFS2 file system withdraws due to metadata corruption, it
often has outstanding transactions in the journal and delayed work
queued for its glocks. This patch adds some new checks for a
withdrawn file system before proceeding with operations that would
obviously cause a BUG() to be triggered. That allows GFS2 to be
safely unmounted rather than cause the system to go down.Signed-off-by: Bob Peterson
15 Mar, 2017
1 commit
-
Commit 88ffbf3e03 switches to using rhashtables for glocks, hashing over
the entire struct lm_lockname instead of its individual fields. On some
architectures, struct lm_lockname contains a hole of uninitialized
memory due to alignment rules, which now leads to incorrect hash values.
Get rid of that hole.Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson
CC: #v4.3+
04 Mar, 2017
1 commit
-
Pull vfs 'statx()' update from Al Viro.
This adds the new extended stat() interface that internally subsumes our
previous stat interfaces, and allows user mode to specify in more detail
what kind of information it wants.It also allows for some explicit synchronization information to be
passed to the filesystem, which can be relevant for network filesystems:
is the cached value ok, or do you need open/close consistency, or what?From David Howells.
Andreas Dilger points out that the first version of the extended statx
interface was posted June 29, 2010:https://www.spinics.net/lists/linux-fsdevel/msg33831.html
* 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
statx: Add a system call to make enhanced file info available
03 Mar, 2017
1 commit
-
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.========
OVERVIEW
========The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.A number of requests were gathered for features to be included. The
following have been included:(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...(This requires a separate system call - I have an fsinfo() call idea
for this).(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.buffer points to the destination for the data. This must be 256 bytes in
size.======================
MAIN ATTRIBUTES RECORD
======================The following structures are defined in which to return the main attribute
set:struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fsWithin the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.Note that there are instances where the type might not be valid, for
instance Windows reparse points.(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000Signed-off-by: David Howells
Signed-off-by: Al Viro
02 Mar, 2017
2 commits
-
…hed.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org> -
Add #include dependencies to all .c files rely on sched.h
doing that for them.Note that even if the count where we need to add extra headers seems high,
it's still a net win, because is included in over
2,200 files ...Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar
25 Feb, 2017
1 commit
-
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.Remove the vma parameter to simplify things.
[arnd@arndb.de: fix ARM build]
Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang
Signed-off-by: Arnd Bergmann
Reviewed-by: Ross Zwisler
Cc: Theodore Ts'o
Cc: Darrick J. Wong
Cc: Matthew Wilcox
Cc: Dave Hansen
Cc: Christoph Hellwig
Cc: Jan Kara
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Feb, 2017
1 commit
-
Pull GFS2 fix from Bob Peterson:
"This is an addendum for the 4.11 merge window.Andy Price wrote this patch to close a nasty race condition that
allows access to glocks that are being destroyed. Without this patch,
GFS2 is vulnerable to random corruption and kernel panic"* tag 'gfs2-4.11.addendum' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Add missing rcu locking for glock lookup
23 Feb, 2017
1 commit
-
We must hold the rcu read lock across looking up glocks and trying to
bump their refcount to prevent the glocks from being freed in between.Cc: # 4.3+
Signed-off-by: Andrew Price
Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson