13 Apr, 2020

3 commits

  • If a dentry's version is somewhere between invalid_before and the current
    directory version, we should be setting it forward to the current version,
    not backwards to the invalid_before version. Note that we're only doing
    this at all because dentry::d_fsdata isn't large enough on a 32-bit system.

    Fix this by using a separate variable for invalid_before so that we don't
    accidentally clobber the current dir version.

    Fixes: a4ff7401fbfa ("afs: Keep track of invalid-before version for dentry coherency")
    Signed-off-by: David Howells

    David Howells
     
  • AFS directories are retained locally as a structured file, with lookup
    being effected by a local search of the file contents. When a modification
    (such as mkdir) happens, the dir file content is modified locally rather
    than redownloading the directory.

    The directory contents are accessed in a number of ways, with a number of
    different locks schemes:

    (1) Download of contents - dvnode->validate_lock/write in afs_read_dir().

    (2) Lookup and readdir - dvnode->validate_lock/read in afs_dir_iterate(),
    downgrading from (1) if necessary.

    (3) d_revalidate of child dentry - dvnode->validate_lock/read in
    afs_do_lookup_one() downgrading from (1) if necessary.

    (4) Edit of dir after modification - page locks on individual dir pages.

    Unfortunately, because (4) uses different locking scheme to (1) - (3),
    nothing protects against the page being scanned whilst the edit is
    underway. Even download is not safe as it doesn't lock the pages - relying
    instead on the validate_lock to serialise as a whole (the theory being that
    directory contents are treated as a block and always downloaded as a
    block).

    Fix this by write-locking dvnode->validate_lock around the edits. Care
    must be taken in the rename case as there may be two different dirs - but
    they need not be locked at the same time. In any case, once the lock is
    taken, the directory version must be rechecked, and the edit skipped if a
    later version has been downloaded by revalidation (there can't have been
    any local changes because the VFS holds the inode lock, but there can have
    been remote changes).

    Fixes: 63a4681ff39c ("afs: Locally edit directory data for mkdir/create/unlink/...")
    Signed-off-by: David Howells

    David Howells
     
  • The afs_deliver_fs_rename() and yfs_deliver_fs_rename() functions both only
    decode the second file status returned unless the parent directories are
    different - unfortunately, this means that the xdr pointer isn't advanced
    and the volsync record will be read incorrectly in such an instance.

    Fix this by always decoding the second status into the second
    status/callback block which wasn't being used if the dirs were the same.

    The afs_update_dentry_version() calls that update the directory data
    version numbers on the dentries can then unconditionally use the second
    status record as this will always reflect the state of the destination dir
    (the two records will be identical if the destination dir is the same as
    the source dir)

    Fixes: 260a980317da ("[AFS]: Add "directory write" support.")
    Fixes: 30062bd13e36 ("afs: Implement YFS support in the fs client")
    Signed-off-by: David Howells

    David Howells
     

15 Jan, 2020

2 commits

  • Fix afs_lookup() to not clobber the version set on a new dentry by
    afs_do_lookup() - especially as it's using the wrong version of the
    version (we need to use the one given to us by whatever op the dir
    contents correspond to rather than what's in the afs_vnode).

    Fixes: 9dd0b82ef530 ("afs: Fix missing dentry data version updating")
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • afs_lookup() has a tracepoint to indicate the outcome of
    d_splice_alias(), passing it the inode to retrieve the fid from.
    However, the function gave up its ref on that inode when it called
    d_splice_alias(), which may have failed and dropped the inode.

    Fix this by caching the fid.

    Fixes: 80548b03991f ("afs: Add more tracepoints")
    Reported-by: Al Viro
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     

16 Nov, 2019

1 commit

  • When a lookup is done, the afs filesystem will perform a bulk status-fetch
    operation on the requested vnode (file) plus the next 49 other vnodes from
    the directory list (in AFS, directory contents are downloaded as blobs and
    parsed locally). When the results are received, it will speculatively
    populate the inode cache from the extra data.

    However, if the lookup races with another lookup on the same directory, but
    for a different file - one that's in the 49 extra fetches, then if the bulk
    status-fetch operation finishes first, it will try and update the inode
    from the other lookup.

    If this other inode is still in the throes of being created, however, this
    will cause an assertion failure in afs_apply_status():

    BUG_ON(test_bit(AFS_VNODE_UNSET, &vnode->flags));

    on or about fs/afs/inode.c:175 because it expects data to be there already
    that it can compare to.

    Fix this by skipping the update if the inode is being created as the
    creator will presumably set up the inode with the same information.

    Fixes: 39db9815da48 ("afs: Fix application of the results of a inline bulk status fetch")
    Signed-off-by: David Howells
    Reviewed-by: Marc Dionne
    Signed-off-by: Linus Torvalds

    David Howells
     

02 Sep, 2019

1 commit

  • Make afs_permission() and afs_d_revalidate() do initial checks in RCU-mode
    pathwalk to reduce latency in pathwalk elements that get done multiple
    times. We don't need to query the server unless we've received a
    notification from it that something has changed or the callback has
    expired.

    This requires that we can request a key and check permits under RCU
    conditions if we need to.

    Signed-off-by: David Howells

    David Howells
     

22 Aug, 2019

1 commit

  • The afs_lookup trace event can cause the following:

    [ 216.576777] BUG: kernel NULL pointer dereference, address: 000000000000023b
    [ 216.576803] #PF: supervisor read access in kernel mode
    [ 216.576813] #PF: error_code(0x0000) - not-present page
    ...
    [ 216.576913] RIP: 0010:trace_event_raw_event_afs_lookup+0x9e/0x1c0 [kafs]

    If the inode from afs_do_lookup() is an error other than ENOENT, or if it
    is ENOENT and afs_try_auto_mntpt() returns an error, the trace event will
    try to dereference the error pointer as a valid pointer.

    Use IS_ERR_OR_NULL to only pass a valid pointer for the trace, or NULL.

    Ideally the trace would include the error value, but for now just avoid
    the oops.

    Fixes: 80548b03991f ("afs: Add more tracepoints")
    Signed-off-by: Marc Dionne
    Signed-off-by: David Howells

    Marc Dionne
     

30 Jul, 2019

3 commits

  • In the in-kernel afs filesystem, the d_fsdata dentry field is used to hold
    the data version of the parent directory when it was created or when
    d_revalidate() last caused it to be updated. This is compared to the
    ->invalid_before field in the directory inode, rather than the actual data
    version number, thereby allowing changes due to local edits to be ignored.
    Only if the server data version gets bumped unexpectedly (eg. by a
    competing client), do we need to revalidate stuff.

    However, the d_fsdata field should also be updated if an rpc op is
    performed that modifies that particular dentry. Such ops return the
    revised data version of the directory(ies) involved, so we should use that.

    This is particularly problematic for rename, since a dentry from one
    directory may be moved directly into another directory (ie. mv a/x b/x).
    It would then be sporting the wrong data version - and if this is in the
    future, for the destination directory, revalidations would be missed,
    leading to foreign renames and hard-link deletion being missed.

    Fix this by the following means:

    (1) Return the data version number from operations that read the directory
    contents - if they issue the read. This starts in afs_dir_iterate()
    and is used, ignored or passed back by its callers.

    (2) In afs_lookup*(), set the dentry version to the version returned by
    (1) before d_splice_alias() is called and the dentry published.

    (3) In afs_d_revalidate(), set the dentry version to that returned from
    (1) if an rpc call was issued. This means that if a parallel
    procedure, such as mkdir(), modifies the directory, we won't
    accidentally use the data version from that.

    (4) In afs_{mkdir,create,link,symlink}(), set the new dentry's version to
    the directory data version before d_instantiate() is called.

    (5) In afs_{rmdir,unlink}, update the target dentry's version to the
    directory data version as soon as we've updated the directory inode.

    (6) In afs_rename(), we need to unhash the old dentry before we start so
    that we don't get afs_d_revalidate() reverting the version change in
    cross-directory renames.

    We then need to set both the old and the new dentry versions the data
    version of the new directory before we call d_move() as d_move() will
    rehash them.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: David Howells

    David Howells
     
  • In the in-kernel afs filesystem, d_fsdata is set with the data version of
    the parent directory. afs_d_revalidate() will update this to the current
    directory version, but it shouldn't do this if it the value it read from
    d_fsdata is the same as no lock is held and cmpxchg() is not used.

    Fix the code to only change the value if it is different from the current
    directory version.

    Fixes: 260a980317da ("[AFS]: Add "directory write" support.")
    Signed-off-by: David Howells

    David Howells
     
  • When afs_rename() calculates the expected data version of the target
    directory in a cross-directory rename, it doesn't increment it as it
    should, so it always thinks that the target inode is unexpectedly modified
    on the server.

    Fixes: a58823ac4589 ("afs: Fix application of status and callback to be under same lock")
    Signed-off-by: David Howells

    David Howells
     

11 Jul, 2019

1 commit

  • Pull afs updates from David Howells:
    "A set of minor changes for AFS:

    - Remove an unnecessary check in afs_unlink()

    - Add a tracepoint for tracking callback management

    - Add a tracepoint for afs_server object usage

    - Use struct_size()

    - Add mappings for AFS UAE abort codes to Linux error codes, using
    symbolic names rather than hex numbers in the .c file"

    * tag 'afs-next-20190628' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    afs: Add support for the UAE error table
    fs/afs: use struct_size() in kzalloc()
    afs: Trace afs_server usage
    afs: Add some callback management tracepoints
    afs: afs_unlink() doesn't need to check dentry->d_inode

    Linus Torvalds
     

21 Jun, 2019

3 commits

  • As Gustavo said in other patches doing the same replace, we can now
    use the new struct_size() helper to avoid leaving these open-coded and
    prone to type mistake.

    Signed-off-by: Zhengyuan Liu
    Signed-off-by: David Howells

    Zhengyuan Liu
     
  • Add a couple of tracepoints to track callback management:

    (1) afs_cb_miss - Logs when we were unable to apply a callback, either due
    to the inode being discarded or due to a competing thread applying a
    callback first.

    (2) afs_cb_break - Logs when we attempted to clear the noted callback
    promise, either due to the server explicitly breaking the callback,
    the callback promise lapsing or a local event obsoleting it.

    Signed-off-by: David Howells

    David Howells
     
  • Don't check that dentry->d_inode is valid in afs_unlink(). We should be
    able to take that as given.

    This caused Smatch to issue the following warning:

    fs/afs/dir.c:1392 afs_unlink() error: we previously assumed 'vnode' could be null (see line 1375)

    Reported-by: kbuild test robot
    Reported-by: Dan Carpenter
    Signed-off-by: David Howells

    David Howells
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

17 May, 2019

4 commits

  • Fix afs_do_lookup() such that when it does an inline bulk status fetch op,
    it will update inodes that are already extant (something that afs_iget()
    doesn't do) and to cache permits for each inode created (thereby avoiding a
    follow up FS.FetchStatus call to determine this).

    Extant inodes need looking up in advance so that their cb_break counters
    before and after the operation can be compared. To this end, the inode
    pointers are cached so that they don't need looking up again after the op.

    Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup")
    Signed-off-by: David Howells

    David Howells
     
  • Pass the server and volume break counts from before the status fetch
    operation that queried the attributes of a file into afs_iget5_set() so
    that the new vnode's break counters can be initialised appropriately.

    This allows detection of a volume or server break that happened whilst we
    were fetching the status or setting up the vnode.

    Fixes: c435ee34551e ("afs: Overhaul the callback handling")
    Signed-off-by: David Howells

    David Howells
     
  • Make use of the status update for the target file that the YFS.RemoveFile2
    RPC op returns to correctly update the vnode as to whether the file was
    actually deleted or just had nlink reduced.

    Fixes: 30062bd13e36 ("afs: Implement YFS support in the fs client")
    Signed-off-by: David Howells

    David Howells
     
  • Use RCU-based freeing for afs_cb_interest struct objects and use RCU on
    vnode->cb_interest. Use that change to allow afs_check_validity() to use
    read_seqbegin_or_lock() instead of read_seqlock_excl().

    This also requires the caller of afs_check_validity() to hold the RCU read
    lock across the call.

    Signed-off-by: David Howells

    David Howells
     

16 May, 2019

3 commits

  • When applying the status and callback in the response of an operation,
    apply them in the same critical section so that there's no race between
    checking the callback state and checking status-dependent state (such as
    the data version).

    Fix this by:

    (1) Allocating a joint {status,callback} record (afs_status_cb) before
    calling the RPC function for each vnode for which the RPC reply
    contains a status or a status plus a callback. A flag is set in the
    record to indicate if a callback was actually received.

    (2) These records are passed into the RPC functions to be filled in. The
    afs_decode_status() and yfs_decode_status() functions are removed and
    the cb_lock is no longer taken.

    (3) xdr_decode_AFSFetchStatus() and xdr_decode_YFSFetchStatus() no longer
    update the vnode.

    (4) xdr_decode_AFSCallBack() and xdr_decode_YFSCallBack() no longer update
    the vnode.

    (5) vnodes, expected data-version numbers and callback break counters
    (cb_break) no longer need to be passed to the reply delivery
    functions.

    Note that, for the moment, the file locking functions still need
    access to both the call and the vnode at the same time.

    (6) afs_vnode_commit_status() is now given the cb_break value and the
    expected data_version and the task of applying the status and the
    callback to the vnode are now done here.

    This is done under a single taking of vnode->cb_lock.

    (7) afs_pages_written_back() is now called by afs_store_data() rather than
    by the reply delivery function.

    afs_pages_written_back() has been moved to before the call point and
    is now given the first and last page numbers rather than a pointer to
    the call.

    (8) The indicator from YFS.RemoveFile2 as to whether the target file
    actually got removed (status.abort_code == VNOVNODE) rather than
    merely dropping a link is now checked in afs_unlink rather than in
    xdr_decode_YFSFetchStatus().

    Supplementary fixes:

    (*) afs_cache_permit() now gets the caller_access mask from the
    afs_status_cb object rather than picking it out of the vnode's status
    record. afs_fetch_status() returns caller_access through its argument
    list for this purpose also.

    (*) afs_inode_init_from_status() now uses a write lock on cb_lock rather
    than a read lock and now sets the callback inside the same critical
    section.

    Fixes: c435ee34551e ("afs: Overhaul the callback handling")
    Signed-off-by: David Howells

    David Howells
     
  • afs_do_lookup() will do an order-1 allocation to allocate status records if
    there are more than 39 vnodes to stat.

    Fix this by allocating an array of {status,callback} records for each vnode
    we want to examine using vmalloc() if larger than a page.

    This not only gets rid of the order-1 allocation, but makes it easier to
    grow beyond 50 records for YFS servers. It also allows us to move to
    {status,callback} tuples for other calls too and makes it easier to lock
    across the application of the status and the callback to the vnode.

    Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup")
    Signed-off-by: David Howells

    David Howells
     
  • Make certain RPC operations non-interruptible, including:

    (*) Set attributes
    (*) Store data

    We don't want to get interrupted during a flush on close, flush on
    unlock, writeback or an inode update, leaving us in a state where we
    still need to do the writeback or update.

    (*) Extend lock
    (*) Release lock

    We don't want to get lock extension interrupted as the file locks on
    the server are time-limited. Interruption during lock release is less
    of an issue since the lock is time-limited, but it's better to
    complete the release to avoid a several-minute wait to recover it.

    *Setting* the lock isn't a problem if it's interrupted since we can
    just return to the user and tell them they were interrupted - at
    which point they can elect to retry.

    (*) Silly unlink

    We want to remove silly unlink files if we can, rather than leaving
    them for the salvager to clear up.

    Note that whilst these calls are no longer interruptible, they do have
    timeouts on them, so if the server stops responding the call will fail with
    something like ETIME or ECONNRESET.

    Without this, the following:

    kAFS: Unexpected error from FS.StoreData -512

    appears in dmesg when a pending store data gets interrupted and some
    processes may just hang.

    Additionally, make the code that checks/updates the server record ignore
    failure due to interruption if the main call is uninterruptible and if the
    server has an address list. The next op will check it again since the
    expiration time on the old list has past.

    Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation")
    Reported-by: Jonathan Billings
    Reported-by: Marc Dionne
    Signed-off-by: David Howells

    David Howells
     

07 May, 2019

1 commit


25 Apr, 2019

4 commits

  • Add four more tracepoints:

    (1) afs_make_fs_call1 - Split from afs_make_fs_call but takes a filename
    to log also.

    (2) afs_make_fs_call2 - Like the above but takes two filenames to log.

    (3) afs_lookup - Log the result of doing a successful lookup, including a
    negative result (fid 0:0).

    (4) afs_get_tree - Log the set up of a volume for mounting.

    It also extends the name buffer on the afs_edit_dir tracepoint to 24 chars
    and puts quotes around the filename in the text representation.

    Signed-off-by: David Howells

    David Howells
     
  • Implement sillyrename for AFS unlink and rename, using the NFS variant
    implementation as a basis.

    Note that the asynchronous file locking extender/releaser has to be
    notified with a state change to stop it complaining if there's a race
    between that and the actual file deletion.

    A tracepoint, afs_silly_rename, is also added to note the silly rename and
    the cleanup. The afs_edit_dir tracepoint is given some extra reason
    indicators and the afs_flock_ev tracepoint is given a silly-delete file
    lock cancellation indicator.

    Signed-off-by: David Howells

    David Howells
     
  • Add a tracepoint (afs_reload_dir) to indicate when a directory is being
    reloaded.

    Signed-off-by: David Howells

    David Howells
     
  • Improve the content of directory check failure reports from:

    kAFS: afs_dir_check_page(6d57): bad magic 1/2 is 0000

    to dump more information about the individual blocks in a directory page.

    Signed-off-by: David Howells

    David Howells
     

30 Nov, 2018

1 commit


24 Oct, 2018

5 commits


06 Aug, 2018

2 commits


14 May, 2018

2 commits

  • It's possible for an AFS file server to issue a whole-volume notification
    that callbacks on all the vnodes in the file have been broken. This is
    done for R/O and backup volumes (which don't have per-file callbacks) and
    for things like a volume being taken offline.

    Fix callback handling to detect whole-volume notifications, to track it
    across operations and to check it during inode validation.

    Fixes: c435ee34551e ("afs: Overhaul the callback handling")
    Signed-off-by: David Howells

    David Howells
     
  • The afs directory loading code (primarily afs_read_dir()) locks all the
    pages that hold a directory's content blob to defend against
    getdents/getdents races and getdents/lookup races where the competitors
    issue conflicting reads on the same data. As the reads will complete
    consecutively, they may retrieve different versions of the data and
    one may overwrite the data that the other is busy parsing.

    Fix this by not locking the pages at all, but rather by turning the
    validation lock into an rwsem and getting an exclusive lock on it whilst
    reading the data or validating the attributes and a shared lock whilst
    parsing the data. Sharing the attribute validation lock should be fine as
    the data fetch will retrieve the attributes also.

    The individual page locks aren't needed at all as the only place they're
    being used is to serialise data loading.

    Without this patch, the:

    if (!test_bit(AFS_VNODE_DIR_VALID, &dvnode->flags)) {
    ...
    }

    part of afs_read_dir() may be skipped, leaving the pages unlocked when we
    hit the success: clause - in which case we try to unlock the not-locked
    pages, leading to the following oops:

    page:ffffe38b405b4300 count:3 mapcount:0 mapping:ffff98156c83a978 index:0x0
    flags: 0xfffe000001004(referenced|private)
    raw: 000fffe000001004 ffff98156c83a978 0000000000000000 00000003ffffffff
    raw: dead000000000100 dead000000000200 0000000000000001 ffff98156b27c000
    page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
    page->mem_cgroup:ffff98156b27c000
    ------------[ cut here ]------------
    kernel BUG at mm/filemap.c:1205!
    ...
    RIP: 0010:unlock_page+0x43/0x50
    ...
    Call Trace:
    afs_dir_iterate+0x789/0x8f0 [kafs]
    ? _cond_resched+0x15/0x30
    ? kmem_cache_alloc_trace+0x166/0x1d0
    ? afs_do_lookup+0x69/0x490 [kafs]
    ? afs_do_lookup+0x101/0x490 [kafs]
    ? key_default_cmp+0x20/0x20
    ? request_key+0x3c/0x80
    ? afs_lookup+0xf1/0x340 [kafs]
    ? __lookup_slow+0x97/0x150
    ? lookup_slow+0x35/0x50
    ? walk_component+0x1bf/0x490
    ? path_lookupat.isra.52+0x75/0x200
    ? filename_lookup.part.66+0xa0/0x170
    ? afs_end_vnode_operation+0x41/0x60 [kafs]
    ? __check_object_size+0x9c/0x171
    ? strncpy_from_user+0x4a/0x170
    ? vfs_statx+0x73/0xe0
    ? __do_sys_newlstat+0x39/0x70
    ? __x64_sys_getdents+0xc9/0x140
    ? __x64_sys_getdents+0x140/0x140
    ? do_syscall_64+0x5b/0x160
    ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: f3ddee8dc4e2 ("afs: Fix directory handling")
    Reported-by: Marc Dionne
    Signed-off-by: David Howells

    David Howells
     

10 Apr, 2018

2 commits

  • Processes like ld that do lots of small writes that aren't necessarily
    contiguous result in a lot of small StoreData operations to the server, the
    idea being that if someone else changes the data on the server, we only
    write our changes over that and not the space between. Further, we don't
    want to write back empty space if we can avoid it to make it easier for the
    server to do sparse files.

    However, making lots of tiny RPC ops is a lot less efficient for the server
    than one big one because each op requires allocation of resources and the
    taking of locks, so we want to compromise a bit.

    Reduce the load by the following:

    (1) If a file is just created locally or has just been truncated with
    O_TRUNC locally, allow subsequent writes to the file to be merged with
    intervening space if that space doesn't cross an entire intervening
    page.

    (2) Don't flush the file on ->flush() but rather on ->release() if the
    file was open for writing.

    Just linking vmlinux.o, without this patch, looking in /proc/fs/afs/stats:

    file-wr : n=441 nb=513581204

    and after the patch:

    file-wr : n=62 nb=513668555

    there were 379 fewer StoreData RPC operations at the expense of an extra
    87K being written.

    Signed-off-by: David Howells

    David Howells
     
  • Locally edit the contents of an AFS directory upon a successful inode
    operation that modifies that directory (such as mkdir, create and unlink)
    so that we can avoid the current practice of re-downloading the directory
    after each change.

    This is viable provided that the directory version number we get back from
    the modifying RPC op is exactly incremented by 1 from what we had
    previously. The data in the directory contents is in a defined format that
    we have to parse locally to perform lookups and readdir, so modifying isn't
    a problem.

    If the edit fails, we just clear the VALID flag on the directory and it
    will be reloaded next time it is needed.

    Signed-off-by: David Howells

    David Howells