08 Jul, 2017

2 commits

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     
  • Before commit 88ffbf3e03 "GFS2: Use resizable hash table for glocks",
    glocks were freed via call_rcu to allow reading the glock hashtable
    locklessly using rcu. This was then changed to free glocks immediately,
    which made reading the glock hashtable unsafe. Bring back the original
    code for freeing glocks via call_rcu.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson
    Cc: stable@vger.kernel.org # 4.3+

    Andreas Gruenbacher
     

06 Jul, 2017

2 commits

  • I noticed on xfs that I could still sometimes get back an error on fsync
    on a fd that was opened after the error condition had been cleared.

    The problem is that the buffer code sets the write_io_error flag and
    then later checks that flag to set the error in the mapping. That flag
    perisists for quite a while however. If the file is later opened with
    O_TRUNC, the buffers will then be invalidated and the mapping's error
    set such that a subsequent fsync will return error. I think this is
    incorrect, as there was no writeback between the open and fsync.

    Add a new mark_buffer_write_io_error operation that sets the flag and
    the error in the mapping at the same time. Replace all calls to
    set_buffer_write_io_error with mark_buffer_write_io_error, and remove
    the places that check this flag in order to set the error in the
    mapping.

    This sets the error in the mapping earlier, at the time that it's first
    detected.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara
    Reviewed-by: Carlos Maiolino

    Jeff Layton
     
  • Pull GFS2 updates from Bob Peterson:
    "We've got eight GFS2 patches for this merge window:

    - Andreas Gruenbacher has four patches related to cleaning up the
    GFS2 inode evict process. This is about half of his patches
    designed to fix a long-standing GFS2 hang related to the inode
    shrinker: Shrinker calls gfs2 evict, evict calls DLM, DLM requires
    memory and blocks on the shrinker.

    These four patches have been well tested. His second set of patches
    are still being tested, so I plan to hold them until the next merge
    window, after we have more weeks of testing. The first patch
    eliminates the flush_delayed_work, which can block.

    - Andreas's second patch protects setting of gl_object for rgrps with
    a spin_lock to prevent proven races.

    - His third patch introduces a centralized mechanism for queueing
    glock work with better reference counting, to prevent more races.

    -His fourth patch retains a reference to inode glocks when an error
    occurs while creating an inode. This keeps the subsequent evict
    from needing to reacquire the glock, which might call into DLM and
    block in low memory conditions.

    - Arvind Yadav has a patch to add const to attribute_group
    structures.

    - I have a patch to detect directory entry inconsistencies and
    withdraw the file system if any are found. Better that than silent
    corruption.

    - I have a patch to remove a vestigial variable from glock
    structures, saving some slab space.

    - I have another patch to remove a vestigial variable from the GFS2
    in-core superblock structure"

    * tag 'gfs2-4.13.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
    GFS2: constify attribute_group structures.
    gfs2: gfs2_create_inode: Keep glock across iput
    gfs2: Clean up glock work enqueuing
    gfs2: Protect gl->gl_object by spin lock
    gfs2: Get rid of flush_delayed_work in gfs2_evict_inode
    GFS2: Eliminate vestigial sd_log_flush_wrapped
    GFS2: Remove gl_list from glock structure
    GFS2: Withdraw when directory entry inconsistencies are detected

    Linus Torvalds
     

05 Jul, 2017

5 commits

  • attribute_groups are not supposed to change at runtime. All functions
    working with attribute_groups provided by work with const
    attribute_group. So mark the non-const structs as const.

    File size before:
    text data bss dec hex filename
    5259 1344 8 6611 19d3 fs/gfs2/sys.o

    File size After adding 'const':
    text data bss dec hex filename
    5371 1216 8 6595 19c3 fs/gfs2/sys.o

    Signed-off-by: Arvind Yadav
    Signed-off-by: Bob Peterson

    Arvind Yadav
     
  • On failure, keep the inode glock across the final iput of the new inode
    so that gfs2_evict_inode doesn't have to re-acquire the glock. That
    way, gfs2_evict_inode won't need to revalidate the block type.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     
  • This patch adds a standardized queueing mechanism for glock work
    with spin_lock protection to prevent races.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     
  • Put all remaining accesses to gl->gl_object under the
    gl->gl_lockref.lock spinlock to prevent races.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     
  • So far, gfs2_evict_inode clears gl->gl_object and then flushes the glock
    work queue to make sure that inode glops which dereference gl->gl_object
    have finished running before the inode is destroyed. However, flushing
    the work queue may do more work than needed, and in particular, it may
    call into DLM, which we want to avoid here. Use a bit lock
    (GIF_GLOP_PENDING) to synchronize between the inode glops and
    gfs2_evict_inode instead to get rid of the flushing.

    In addition, flush the work queues of existing glocks before reusing
    them for new inodes to get those glocks into a known state: the glock
    state engine currently doesn't handle glock re-appropriation correctly.
    (We may be able to fix the glock state engine instead later.)

    Based on a patch by Steven Whitehouse .

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

20 Jun, 2017

1 commit


13 Jun, 2017

3 commits


12 Jun, 2017

1 commit

  • We've already got a few conflicts and upcoming work depends on some of the
    changes that have gone into mainline as regression fixes for this series.

    Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
    trees to continue working on 4.13 changes.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Jun, 2017

2 commits


05 Jun, 2017

1 commit

  • For some file systems we still memcpy into it, but in various places this
    already allows us to use the proper uuid helpers. More to come..

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Amir Goldstein
    Acked-by: Mimi Zohar  (Changes to IMA/EVM)
    Reviewed-by: Andy Shevchenko

    Christoph Hellwig
     

24 May, 2017

1 commit

  • Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
    synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
    definitions. generic_make_request_checks() however strips REQ_FUA and
    REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
    write cache and thus write effectively becomes asynchronous which can
    lead to performance regressions

    Fix the problem by making sure all bios which are synchronous are
    properly marked with REQ_SYNC.

    Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3
    CC: Steven Whitehouse
    CC: cluster-devel@redhat.com
    CC: stable@vger.kernel.org
    Acked-by: Bob Peterson
    Signed-off-by: Jan Kara

    Jan Kara
     

09 May, 2017

1 commit


06 May, 2017

2 commits

  • Pull GFS2 updates from Bob Peterson:
    "We've got ten GFS2 patches for this merge window.

    - Andreas Gruenbacher wrote a patch to replace the deprecated call to
    rhashtable_walk_init with rhashtable_walk_enter.

    - Andreas also wrote a patch to eliminate redundant code in two of
    our debugfs sequence files.

    - Andreas also cleaned up the rhashtable key ugliness Linus pointed
    out during this cycle, following Linus's suggestions.

    - Andreas also wrote a patch to take advantage of his new function
    rhashtable_lookup_get_insert_fast. This makes glock lookup faster
    and more bullet-proof.

    - Andreas also wrote a patch to revert a patch in the evict path that
    caused occasional deadlocks, and is no longer needed.

    - Andrew Price wrote a patch to re-enable fallocate for the rindex
    system file to enable gfs2_grow to grow properly on secondary file
    system grow operations.

    - I wrote a patch to initialize an inode number field to make certain
    kernel trace points more understandable.

    - I also wrote a patch that makes GFS2 file system "withdraw" work
    more like it should by ignoring operations after a withdraw that
    would formerly cause a BUG() and kernel panic.

    - I also reworked the entire truncate/delete algorithm, scrapping the
    old recursive algorithm in favor of a new non-recursive algorithm.
    This was done for performance: This way, GFS2 no longer needs to
    lock multiple resource groups while doing truncates and deletes of
    files that cross multiple resource group boundaries, allowing for
    better parallelism. It also solves a problem whereby deleting large
    files would request a large chunk of kernel memory, which resulted
    in a get_page_from_freelist warning.

    - Due to a regression found during testing, I added a new patch to
    correct 'GFS2: Prevent BUG from occurring when normal Withdraws
    occur'."

    * tag 'gfs2-4.12.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
    GFS2: Allow glocks to be unlocked after withdraw
    GFS2: Non-recursive delete
    gfs2: Re-enable fallocate for the rindex
    Revert "GFS2: Wait for iopen glock dequeues"
    gfs2: Switch to rhashtable_lookup_get_insert_fast
    GFS2: Temporarily zero i_no_addr when creating a dinode
    gfs2: Don't pack struct lm_lockname
    gfs2: Deduplicate gfs2_{glocks,glstats}_open
    gfs2: Replace rhashtable_walk_init with rhashtable_walk_enter
    GFS2: Prevent BUG from occurring when normal Withdraws occur

    Linus Torvalds
     
  • This bug fixes a regression introduced by patch 0d1c7ae9d8.

    The intent of the patch was to stop promoting glocks after a
    file system is withdrawn due to a variety of errors, because doing
    so results in a BUG(). (You should be able to unmount after a
    withdraw rather than having the kernel panic.)

    Unfortunately, it also stopped demotions, so glocks could not be
    unlocked after withdraw, which means the unmount would hang.

    This patch allows function do_xmote to demote locks to an
    unlocked state after a withdraw, but not promote them.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

21 Apr, 2017

2 commits

  • Now that all bdi structures filesystems use are properly refcounted, we
    can remove the SB_I_DYNBDI flag.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Similarly to set_bdev_super() GFS2 just used block device reference to
    bdi. Convert it to properly getting bdi reference. The reference will
    get automatically dropped on superblock destruction.

    CC: Steven Whitehouse
    CC: Bob Peterson
    CC: cluster-devel@redhat.com
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

19 Apr, 2017

1 commit

  • Implement truncate/delete as a non-recursive algorithm. The older
    algorithm was implemented with recursion to strip off each layer
    at a time (going by height, starting with the maximum height.
    This version tries to do the same thing but without recursion,
    and without needing to allocate new structures or lists in memory.

    For example, say you want to truncate a very large file to 1 byte,
    and its end-of-file metapath is: 0.505.463.428. The starting
    metapath would be 0.0.0.0. Since it's a truncate to non-zero, it
    needs to preserve that byte, and all metadata pointing to it.
    So it would start at 0.0.0.0, look up all its metadata buffers,
    then free all data blocks pointed to at the highest level.
    After that buffer is "swept", it moves on to 0.0.0.1, then
    0.0.0.2, etc., reading in buffers and sweeping them clean.
    When it gets to the end of the 0.0.0 metadata buffer (for 4K
    blocks the last valid one is 0.0.0.508), it backs up to the
    previous height and starts working on 0.0.1.0, then 0.0.1.1,
    and so forth. After it reaches the end and sweeps 0.0.1.508,
    it continues with 0.0.2.0, and so on. When that height is
    exhausted, and it reaches 0.0.508.508 it backs up another level,
    to 0.1.0.0, then 0.1.0.1, through 0.1.0.508. So it has to keep
    marching backwards and forwards through the metadata until it's
    all swept clean. Once it has all the data blocks freed, it
    lowers the strip height, and begins the process all over again,
    but with one less height. This time it sweeps 0.0.0 through
    0.505.463. When that's clean, it lowers the strip height again
    and works to free 0.505. Eventually it strips the lowest height, 0.
    For a delete or truncate to 0, all metadata for all heights of
    0.0.0.0 would be freed. For a truncate to 1 byte, 0.0.0.0 would
    be preserved.

    This isn't much different from normal integer incrementing,
    where an integer gets incremented from 0000 (0.0.0.0) to 3021
    (3.0.2.1). So 0000 gets increments to 0001, 0002, up to 0009,
    then on to 0010, 0011 up to 0099, then 0100 and so forth. It's
    just that each "digit" goes from 0 to 508 (for a total of 509
    pointers) rather than from 0 to 9.

    Note that the dinode will only have 483 pointers due to the
    dinode structure itself.

    Also note: this is just an example. These numbers (509 and 483)
    are based on a standard 4K block size. Smaller block sizes will
    yield smaller numbers of indirect pointers accordingly.

    The truncation process is accomplished with the help of two
    major functions and a few helper functions.

    Functions do_strip and recursive_scan are obsolete, so removed.

    New function sweep_bh_for_rgrps cleans a buffer_head pointed to
    by the given metapath and height. By cleaning, I mean it frees
    all blocks starting at the offset passed in metapath. It starts
    at the first block in the buffer pointed to by the metapath and
    identifies its resource group (rgrp). From there it frees all
    subsequent block pointers that lie within that rgrp. If it's
    already inside a transaction, it stays within it as long as it
    can. In other words, it doesn't close a transaction until it knows
    it's freed what it can from the resource group. In this way,
    multiple buffers may be cleaned in a single transaction, as long
    as those blocks in the buffer all lie within the same rgrp.

    If it's not in a transaction, it starts one. If the buffer_head
    has references to blocks within multiple rgrps, it frees all the
    blocks inside the first rgrp it finds, then closes the
    transaction. Then it repeats the cycle: identifies the next
    unfreed block, uses it to find its rgrp, then starts a new
    transaction for that set. It repeats this process repeatedly
    until the buffer_head contains no more references to any blocks
    past the given metapath.

    Function trunc_dealloc has been reworked into a finite state
    automaton. It has basically 3 active states:
    DEALLOC_MP_FULL, DEALLOC_MP_LOWER, and DEALLOC_FILL_MP:

    The DEALLOC_MP_FULL state implies the metapath has a full set
    of buffers out to the "shrink height", and therefore, it can
    call function sweep_bh_for_rgrps to free the blocks within the
    highest height of the metapath. If it's just swept the lowest
    level (or an error has occurred) the state machine is ended.
    Otherwise it proceeds to the DEALLOC_MP_LOWER state.

    The DEALLOC_MP_LOWER state implies we are finished with a given
    buffer_head, which may now be released, and therefore we are
    then missing some buffer information from the metapath. So we
    need to find more buffers to read in. In most cases, this is
    just a matter of releasing the buffer_head and moving to the
    next pointer from the previous height, so it may be read in and
    swept as well. If it can't find another non-null pointer to
    process, it checks whether it's reached the end of a height
    and needs to lower the strip height, or whether it still needs
    move forward through the previous height's metadata. In this
    state, all zero-pointers are skipped. From this state, it can
    only loop around (once more backing up another height) or,
    once a valid metapath is found (one that has non-zero
    pointers), proceed to state DEALLOC_FILL_MP.

    The DEALLOC_FILL_MP state implies that we have a metapath
    but not all its buffers are read in. So we must proceed to read
    in buffer_heads until the metapath has a valid buffer for every
    height. If the previous state backed us up 3 heights, we may
    need to read in a buffer, increment the height, then repeat the
    process until buffers have been read in for all required heights.
    If it's successful reading a buffer, and it's at the highest
    height we need, it proceeds back to the DEALLOC_MP_FULL state.
    If it's unable to fill in a buffer, (encounters a hole, etc.)
    it tries to find another non-zero block pointer. If they're all
    zero, it lowers the height and returns to the DEALLOC_MP_LOWER
    state. If it finds a good non-null pointer, it loops around and
    reads it in, while keeping the metapath in lock-step with the
    pointers it examines.

    The state machine runs until the truncation request is
    satisfied. Then any transactions are ended, the quota and
    statfs data are updated, and the function is complete.

    Helper function metaptr1 was introduced to be an easy way to
    determine the start of a buffer_head's indirect pointers.

    Helper function lookup_mp_height was introduced to find a
    metapath index and read in the buffer that corresponds to it.
    In this way, function lookup_metapath becomes a simple loop to
    call it for every height.

    Helper function fillup_metapath is similar to lookup_metapath
    except it can do partial lookups. If the state machine
    backed up multiple levels (like 2999 wrapping to 3000) it
    needs to find out the next starting point and start issuing
    metadata reads at that point.

    Helper function hptrs is a shortcut to determine how many
    pointers should be expected in a buffer. Height 0 is the dinode
    which has fewer pointers than the others.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

05 Apr, 2017

1 commit

  • Commit 86066914edff2316cbed63aac8a87d5001441a16 "gfs2: Don't support
    fallocate on jdata files" removed the ability of gfs2_grow to reserve
    space at the end of the rindex, which could prevent a second gfs2_grow
    from succeeding if the fs is full. Allow fallocate to work on the rindex
    once again.

    Signed-off-by: Andrew Price
    Signed-off-by: Bob Peterson

    Andrew Price
     

03 Apr, 2017

2 commits

  • Revert commit 86d067a797d4e8546a7c92b985f31e8cd3ec39ad: it turns out
    that waiting for iopen glock dequeues here isn't needed anymore because
    the bugs that commit was meant to fix have been fixed otherwise.

    In addition, we want to avoid waiting on glocks in gfs2_evict_inode in
    shrinker context because the shrinker may be invoked on behalf of DLM,
    in which case calling into DLM again would deadlock. This commit makes
    the described scenario less likely without completely avoiding it; it's
    still a step in the right direction, though.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     
  • Switch from rhashtable_lookup_insert_fast + rhashtable_lookup_fast to
    rhashtable_lookup_get_insert_fast, which is cleaner and avoids an extra
    rhashtable lookup.

    At the same time, turn the retry loop in gfs2_glock_get into an infinite
    loop. The lookup or insert will eventually succeed, usually very fast,
    but there is no reason to give up trying at a fixed number of
    iterations.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

17 Mar, 2017

1 commit

  • Before this patch i_no_addr was not initialized until after the
    return from allocating its block. That meant the i_no_addr was
    temporarily uninitialized storage. Ordinarily that's not a concern,
    but if inplace_reserve can't find space, it can call try_rgrp_unlink
    which references i_no_addr as a block to avoid. That can result in
    unpredictable behavior. More importantly, the trace point in
    gfs2_alloc_blocks references ip->i_no_addr before it is set, which
    is misleading when reading the kernel traces. This patch makes it
    look like the new dinode block was assigned in the name of inode 0
    rather than a random inode that's completely unrelated.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

16 Mar, 2017

4 commits


15 Mar, 2017

1 commit

  • Commit 88ffbf3e03 switches to using rhashtables for glocks, hashing over
    the entire struct lm_lockname instead of its individual fields. On some
    architectures, struct lm_lockname contains a hole of uninitialized
    memory due to alignment rules, which now leads to incorrect hash values.
    Get rid of that hole.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson
    CC: #v4.3+

    Andreas Gruenbacher
     

04 Mar, 2017

1 commit

  • Pull vfs 'statx()' update from Al Viro.

    This adds the new extended stat() interface that internally subsumes our
    previous stat interfaces, and allows user mode to specify in more detail
    what kind of information it wants.

    It also allows for some explicit synchronization information to be
    passed to the filesystem, which can be relevant for network filesystems:
    is the cached value ok, or do you need open/close consistency, or what?

    From David Howells.

    Andreas Dilger points out that the first version of the extended statx
    interface was posted June 29, 2010:

    https://www.spinics.net/lists/linux-fsdevel/msg33831.html

    * 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    statx: Add a system call to make enhanced file info available

    Linus Torvalds
     

03 Mar, 2017

1 commit

  • Add a system call to make extended file information available, including
    file creation and some attribute flags where available through the
    underlying filesystem.

    The getattr inode operation is altered to take two additional arguments: a
    u32 request_mask and an unsigned int flags that indicate the
    synchronisation mode. This change is propagated to the vfs_getattr*()
    function.

    Functions like vfs_stat() are now inline wrappers around new functions
    vfs_statx() and vfs_statx_fd() to reduce stack usage.

    ========
    OVERVIEW
    ========

    The idea was initially proposed as a set of xattrs that could be retrieved
    with getxattr(), but the general preference proved to be for a new syscall
    with an extended stat structure.

    A number of requests were gathered for features to be included. The
    following have been included:

    (1) Make the fields a consistent size on all arches and make them large.

    (2) Spare space, request flags and information flags are provided for
    future expansion.

    (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
    __s64).

    (4) Creation time: The SMB protocol carries the creation time, which could
    be exported by Samba, which will in turn help CIFS make use of
    FS-Cache as that can be used for coherency data (stx_btime).

    This is also specified in NFSv4 as a recommended attribute and could
    be exported by NFSD [Steve French].

    (5) Lightweight stat: Ask for just those details of interest, and allow a
    netfs (such as NFS) to approximate anything not of interest, possibly
    without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
    Dilger] (AT_STATX_DONT_SYNC).

    (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
    its cached attributes are up to date [Trond Myklebust]
    (AT_STATX_FORCE_SYNC).

    And the following have been left out for future extension:

    (7) Data version number: Could be used by userspace NFS servers [Aneesh
    Kumar].

    Can also be used to modify fill_post_wcc() in NFSD which retrieves
    i_version directly, but has just called vfs_getattr(). It could get
    it from the kstat struct if it used vfs_xgetattr() instead.

    (There's disagreement on the exact semantics of a single field, since
    not all filesystems do this the same way).

    (8) BSD stat compatibility: Including more fields from the BSD stat such
    as creation time (st_btime) and inode generation number (st_gen)
    [Jeremy Allison, Bernd Schubert].

    (9) Inode generation number: Useful for FUSE and userspace NFS servers
    [Bernd Schubert].

    (This was asked for but later deemed unnecessary with the
    open-by-handle capability available and caused disagreement as to
    whether it's a security hole or not).

    (10) Extra coherency data may be useful in making backups [Andreas Dilger].

    (No particular data were offered, but things like last backup
    timestamp, the data version number and the DOS archive bit would come
    into this category).

    (11) Allow the filesystem to indicate what it can/cannot provide: A
    filesystem can now say it doesn't support a standard stat feature if
    that isn't available, so if, for instance, inode numbers or UIDs don't
    exist or are fabricated locally...

    (This requires a separate system call - I have an fsinfo() call idea
    for this).

    (12) Store a 16-byte volume ID in the superblock that can be returned in
    struct xstat [Steve French].

    (Deferred to fsinfo).

    (13) Include granularity fields in the time data to indicate the
    granularity of each of the times (NFSv4 time_delta) [Steve French].

    (Deferred to fsinfo).

    (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
    Note that the Linux IOC flags are a mess and filesystems such as Ext4
    define flags that aren't in linux/fs.h, so translation in the kernel
    may be a necessity (or, possibly, we provide the filesystem type too).

    (Some attributes are made available in stx_attributes, but the general
    feeling was that the IOC flags were to ext[234]-specific and shouldn't
    be exposed through statx this way).

    (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
    Michael Kerrisk].

    (Deferred, probably to fsinfo. Finding out if there's an ACL or
    seclabal might require extra filesystem operations).

    (16) Femtosecond-resolution timestamps [Dave Chinner].

    (A __reserved field has been left in the statx_timestamp struct for
    this - if there proves to be a need).

    (17) A set multiple attributes syscall to go with this.

    ===============
    NEW SYSTEM CALL
    ===============

    The new system call is:

    int ret = statx(int dfd,
    const char *filename,
    unsigned int flags,
    unsigned int mask,
    struct statx *buffer);

    The dfd, filename and flags parameters indicate the file to query, in a
    similar way to fstatat(). There is no equivalent of lstat() as that can be
    emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
    also no equivalent of fstat() as that can be emulated by passing a NULL
    filename to statx() with the fd of interest in dfd.

    Whether or not statx() synchronises the attributes with the backing store
    can be controlled by OR'ing a value into the flags argument (this typically
    only affects network filesystems):

    (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
    respect.

    (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
    its attributes with the server - which might require data writeback to
    occur to get the timestamps correct.

    (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
    network filesystem. The resulting values should be considered
    approximate.

    mask is a bitmask indicating the fields in struct statx that are of
    interest to the caller. The user should set this to STATX_BASIC_STATS to
    get the basic set returned by stat(). It should be noted that asking for
    more information may entail extra I/O operations.

    buffer points to the destination for the data. This must be 256 bytes in
    size.

    ======================
    MAIN ATTRIBUTES RECORD
    ======================

    The following structures are defined in which to return the main attribute
    set:

    struct statx_timestamp {
    __s64 tv_sec;
    __s32 tv_nsec;
    __s32 __reserved;
    };

    struct statx {
    __u32 stx_mask;
    __u32 stx_blksize;
    __u64 stx_attributes;
    __u32 stx_nlink;
    __u32 stx_uid;
    __u32 stx_gid;
    __u16 stx_mode;
    __u16 __spare0[1];
    __u64 stx_ino;
    __u64 stx_size;
    __u64 stx_blocks;
    __u64 __spare1[1];
    struct statx_timestamp stx_atime;
    struct statx_timestamp stx_btime;
    struct statx_timestamp stx_ctime;
    struct statx_timestamp stx_mtime;
    __u32 stx_rdev_major;
    __u32 stx_rdev_minor;
    __u32 stx_dev_major;
    __u32 stx_dev_minor;
    __u64 __spare2[14];
    };

    The defined bits in request_mask and stx_mask are:

    STATX_TYPE Want/got stx_mode & S_IFMT
    STATX_MODE Want/got stx_mode & ~S_IFMT
    STATX_NLINK Want/got stx_nlink
    STATX_UID Want/got stx_uid
    STATX_GID Want/got stx_gid
    STATX_ATIME Want/got stx_atime{,_ns}
    STATX_MTIME Want/got stx_mtime{,_ns}
    STATX_CTIME Want/got stx_ctime{,_ns}
    STATX_INO Want/got stx_ino
    STATX_SIZE Want/got stx_size
    STATX_BLOCKS Want/got stx_blocks
    STATX_BASIC_STATS [The stuff in the normal stat struct]
    STATX_BTIME Want/got stx_btime{,_ns}
    STATX_ALL [All currently available stuff]

    stx_btime is the file creation time, stx_mask is a bitmask indicating the
    data provided and __spares*[] are where as-yet undefined fields can be
    placed.

    Time fields are structures with separate seconds and nanoseconds fields
    plus a reserved field in case we want to add even finer resolution. Note
    that times will be negative if before 1970; in such a case, the nanosecond
    fields will also be negative if not zero.

    The bits defined in the stx_attributes field convey information about a
    file, how it is accessed, where it is and what it does. The following
    attributes map to FS_*_FL flags and are the same numerical value:

    STATX_ATTR_COMPRESSED File is compressed by the fs
    STATX_ATTR_IMMUTABLE File is marked immutable
    STATX_ATTR_APPEND File is append-only
    STATX_ATTR_NODUMP File is not to be dumped
    STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

    Within the kernel, the supported flags are listed by:

    KSTAT_ATTR_FS_IOC_FLAGS

    [Are any other IOC flags of sufficient general interest to be exposed
    through this interface?]

    New flags include:

    STATX_ATTR_AUTOMOUNT Object is an automount trigger

    These are for the use of GUI tools that might want to mark files specially,
    depending on what they are.

    Fields in struct statx come in a number of classes:

    (0) stx_dev_*, stx_blksize.

    These are local system information and are always available.

    (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
    stx_size, stx_blocks.

    These will be returned whether the caller asks for them or not. The
    corresponding bits in stx_mask will be set to indicate whether they
    actually have valid values.

    If the caller didn't ask for them, then they may be approximated. For
    example, NFS won't waste any time updating them from the server,
    unless as a byproduct of updating something requested.

    If the values don't actually exist for the underlying object (such as
    UID or GID on a DOS file), then the bit won't be set in the stx_mask,
    even if the caller asked for the value. In such a case, the returned
    value will be a fabrication.

    Note that there are instances where the type might not be valid, for
    instance Windows reparse points.

    (2) stx_rdev_*.

    This will be set only if stx_mode indicates we're looking at a
    blockdev or a chardev, otherwise will be 0.

    (3) stx_btime.

    Similar to (1), except this will be set to 0 if it doesn't exist.

    =======
    TESTING
    =======

    The following test program can be used to test the statx system call:

    samples/statx/test-statx.c

    Just compile and run, passing it paths to the files you want to examine.
    The file is built automatically if CONFIG_SAMPLES is enabled.

    Here's some example output. Firstly, an NFS directory that crosses to
    another FSID. Note that the AUTOMOUNT attribute is set because transiting
    this directory will cause d_automount to be invoked by the VFS.

    [root@andromeda ~]# /tmp/test-statx -A /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:26 Inode: 1703937 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000
    Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

    Secondly, the result of automounting on that directory.

    [root@andromeda ~]# /tmp/test-statx /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:27 Inode: 2 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

02 Mar, 2017

2 commits


25 Feb, 2017

1 commit

  • ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
    take a vma and vmf parameter when the vma already resides in vmf.

    Remove the vma parameter to simplify things.

    [arnd@arndb.de: fix ARM build]
    Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ross Zwisler
    Cc: Theodore Ts'o
    Cc: Darrick J. Wong
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

24 Feb, 2017

1 commit


23 Feb, 2017

1 commit