04 Jun, 2020

3 commits

  • Whilst it shouldn't happen, it is possible for multiple fileservers to
    share a UUID, particularly if an entire cell has been duplicated, UUIDs and
    all. In such a case, it's not necessarily possible to map the effect of
    the CB.InitCallBackState3 incoming RPC to a specific server unambiguously
    by UUID and thus to a specific cell.

    Indeed, there's a problem whereby multiple server records may need to
    occupy the same spot in the rb_tree rooted in the afs_net struct.

    Fix this by allowing servers to form a list, with the head of the list in
    the tree. When the front entry in the list is removed, the second in the
    list just replaces it. afs_init_callback_state() then just goes down the
    line, poking each server in the list.

    This means that some servers will be unnecessarily poked, unfortunately.
    An alternative would be to route by call parameters.

    Reported-by: Jeffrey Altman
    Signed-off-by: David Howells
    Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation")

    David Howells
     
  • Reorganise afs_volume objects such that they're in a tree keyed on volume
    ID, rooted at on an afs_cell object rather than being in multiple trees,
    each of which is rooted on an afs_server object.

    afs_server structs become per-cell and acquire a pointer to the cell.

    The process of breaking a callback then starts with finding the server by
    its network address, following that to the cell and then looking up each
    volume ID in the volume tree.

    This is simpler than the afs_vol_interest/afs_cb_interest N:M mapping web
    and allows those structs and the code for maintaining them to be simplified
    or removed.

    It does make a couple of things a bit more tricky, though:

    (1) Operations now start with a volume, not a server, so there can be more
    than one answer as to whether or not the server we'll end up using
    supports the FS.InlineBulkStatus RPC.

    (2) CB RPC operations that specify the server UUID. There's still a tree
    of servers by UUID on the afs_net struct, but the UUIDs in it aren't
    guaranteed unique.

    Signed-off-by: David Howells

    David Howells
     
  • Turn the afs_operation struct into the main way that most fileserver
    operations are managed. Various things are added to the struct, including
    the following:

    (1) All the parameters and results of the relevant operations are moved
    into it, removing corresponding fields from the afs_call struct.
    afs_call gets a pointer to the op.

    (2) The target volume is made the main focus of the operation, rather than
    the target vnode(s), and a bunch of op->vnode->volume are made
    op->volume instead.

    (3) Two vnode records are defined (op->file[]) for the vnode(s) involved
    in most operations. The vnode record (struct afs_vnode_param)
    contains:

    - The vnode pointer.

    - The fid of the vnode to be included in the parameters or that was
    returned in the reply (eg. FS.MakeDir).

    - The status and callback information that may be returned in the
    reply about the vnode.

    - Callback break and data version tracking for detecting
    simultaneous third-parth changes.

    (4) Pointers to dentries to be updated with new inodes.

    (5) An operations table pointer. The table includes pointers to functions
    for issuing AFS and YFS-variant RPCs, handling the success and abort
    of an operation and handling post-I/O-lock local editing of a
    directory.

    To make this work, the following function restructuring is made:

    (A) The rotation loop that issues calls to fileservers that can be found
    in each function that wants to issue an RPC (such as afs_mkdir()) is
    extracted out into common code, in a new file called fs_operation.c.

    (B) The rotation loops, such as the one in afs_mkdir(), are replaced with
    a much smaller piece of code that allocates an operation, sets the
    parameters and then calls out to the common code to do the actual
    work.

    (C) The code for handling the success and failure of an operation are
    moved into operation functions (as (5) above) and these are called
    from the core code at appropriate times.

    (D) The pseudo inode getting stuff used by the dynamic root code is moved
    over into dynroot.c.

    (E) struct afs_iget_data is absorbed into the operation struct and
    afs_iget() expects to be given an op pointer and a vnode record.

    (F) Point (E) doesn't work for the root dir of a volume, but we know the
    FID in advance (it's always vnode 1, unique 1), so a separate inode
    getter, afs_root_iget(), is provided to special-case that.

    (G) The inode status init/update functions now also take an op and a vnode
    record.

    (H) The RPC marshalling functions now, for the most part, just take an
    afs_operation struct as their only argument. All the data they need
    is held there. The result delivery functions write their answers
    there as well.

    (I) The call is attached to the operation and then the operation core does
    the waiting.

    And then the new operation code is, for the moment, made to just initialise
    the operation, get the appropriate vnode I/O locks and do the same rotation
    loop as before.

    This lays the foundation for the following changes in the future:

    (*) Overhauling the rotation (again).

    (*) Support for asynchronous I/O, where the fileserver rotation must be
    done asynchronously also.

    Signed-off-by: David Howells

    David Howells
     

31 May, 2020

2 commits

  • afs_vol_interest objects represent the volume IDs currently being accessed
    from a fileserver. These hold lists of afs_cb_interest objects that
    repesent the superblocks using that volume ID on that server.

    When a callback notification from the server telling of a modification by
    another client arrives, the volume ID specified in the notification is
    looked up in the server's afs_vol_interest list. Through the
    afs_cb_interest list, the relevant superblocks can be iterated over and the
    specific inode looked up and marked in each one.

    Make the following efficiency improvements:

    (1) Hold rcu_read_lock() over the entire processing rather than locking it
    each time.

    (2) Do all the callbacks for each vid together rather than individually.
    Each volume then only needs to be looked up once.

    (3) afs_vol_interest objects are now stored in an rb_tree rather than a
    flat list to reduce the lookup step count.

    (4) afs_vol_interest lookup is now done with RCU, but because it's in an
    rb_tree which may rotate under us, a seqlock is used so that if it
    changes during the walk, we repeat the walk with a lock held.

    With this and the preceding patch which adds RCU-based lookups in the inode
    cache, target volumes/vnodes can be taken without the need to take any
    locks, except on the target itself.

    Signed-off-by: David Howells

    David Howells
     
  • Make the inode hash table RCU searchable so that searches that want to
    access or modify an inode without taking a ref on that inode can do so
    without taking the inode hash table lock.

    The main thing this requires is some RCU annotation on the list
    manipulation operations. Inodes are already freed by RCU in most cases.

    Users of this interface must take care as the inode may be still under
    construction or may be being torn down around them.

    There are at least three instances where this can be of use:

    (1) Testing whether the inode number iunique() is going to return is
    currently unique (the iunique_lock is still held).

    (2) Ext4 date stamp updating.

    (3) AFS callback breaking.

    Signed-off-by: David Howells
    Acked-by: Konstantin Khlebnikov
    cc: linux-ext4@vger.kernel.org
    cc: linux-afs@lists.infradead.org

    David Howells
     

23 Nov, 2019

1 commit

  • Servers sending callback breaks to the YFS_CM_SERVICE service may
    send up to YFSCBMAX (1024) fids in a single RPC. Anything over
    AFSCBMAX (50) will cause the assert in afs_break_callbacks to trigger.

    Remove the assert, as the count has already been checked against
    the appropriate max values in afs_deliver_cb_callback and
    afs_deliver_yfs_cb_callback.

    Fixes: 35dbfba3111a ("afs: Implement the YFS cache manager service")
    Signed-off-by: Marc Dionne
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    Marc Dionne
     

21 Jun, 2019

2 commits

  • Add a tracepoint (afs_server) to track the afs_server object usage count.

    Signed-off-by: David Howells

    David Howells
     
  • Add a couple of tracepoints to track callback management:

    (1) afs_cb_miss - Logs when we were unable to apply a callback, either due
    to the inode being discarded or due to a competing thread applying a
    callback first.

    (2) afs_cb_break - Logs when we attempted to clear the noted callback
    promise, either due to the server explicitly breaking the callback,
    the callback promise lapsing or a local event obsoleting it.

    Signed-off-by: David Howells

    David Howells
     

20 Jun, 2019

1 commit

  • Fix the cb_break_lock spinlock in afs_volume struct by initialising it when
    the volume record is allocated.

    Also rename the lock to cb_v_break_lock to distinguish it from the lock of
    the same name in the afs_server struct.

    Without this, the following trace may be observed when a volume-break
    callback is received:

    INFO: trying to register non-static key.
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    CPU: 2 PID: 50 Comm: kworker/2:1 Not tainted 5.2.0-rc1-fscache+ #3045
    Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
    Workqueue: afs SRXAFSCB_CallBack
    Call Trace:
    dump_stack+0x67/0x8e
    register_lock_class+0x23b/0x421
    ? check_usage_forwards+0x13c/0x13c
    __lock_acquire+0x89/0xf73
    lock_acquire+0x13b/0x166
    ? afs_break_callbacks+0x1b2/0x3dd
    _raw_write_lock+0x2c/0x36
    ? afs_break_callbacks+0x1b2/0x3dd
    afs_break_callbacks+0x1b2/0x3dd
    ? trace_event_raw_event_afs_server+0x61/0xac
    SRXAFSCB_CallBack+0x11f/0x16c
    process_one_work+0x2c5/0x4ee
    ? worker_thread+0x234/0x2ac
    worker_thread+0x1d8/0x2ac
    ? cancel_delayed_work_sync+0xf/0xf
    kthread+0x11f/0x127
    ? kthread_park+0x76/0x76
    ret_from_fork+0x24/0x30

    Fixes: 68251f0a6818 ("afs: Fix whole-volume callback handling")
    Signed-off-by: David Howells

    David Howells
     

17 May, 2019

1 commit

  • Use RCU-based freeing for afs_cb_interest struct objects and use RCU on
    vnode->cb_interest. Use that change to allow afs_check_validity() to use
    read_seqbegin_or_lock() instead of read_seqlock_excl().

    This also requires the caller of afs_check_validity() to hold the RCU read
    lock across the call.

    Signed-off-by: David Howells

    David Howells
     

16 May, 2019

1 commit

  • __afs_break_callback() holds vnode->lock around its call of
    afs_lock_may_be_available() - which also takes that lock.

    Fix this by not taking the lock in __afs_break_callback().

    Also, there's no point checking the granted_locks and pending_locks queues;
    it's sufficient to check lock_state, so move that check out of
    afs_lock_may_be_available() into __afs_break_callback() to replace the
    queue checks.

    Fixes: e8d6c554126b ("AFS: implement file locking")
    Signed-off-by: David Howells

    David Howells
     

13 Apr, 2019

1 commit

  • The in-kernel afs filesystem client counts the number of server-level
    callback invalidation events (CB.InitCallBackState* RPC operations) that it
    receives from the server. This is stored in cb_s_break in various
    structures, including afs_server and afs_vnode.

    If an inode is examined by afs_validate(), say, the afs_server copy is
    compared, along with other break counters, to those in afs_vnode, and if
    one or more of the counters do not match, it is considered that the
    server's callback promise is broken. At points where this happens,
    AFS_VNODE_CB_PROMISED is cleared to indicate that the status must be
    refetched from the server.

    afs_validate() issues an FS.FetchStatus operation to get updated metadata -
    and based on the updated data_version may invalidate the pagecache too.

    However, the break counters are also used to determine whether to note a
    new callback in the vnode (which would set the AFS_VNODE_CB_PROMISED flag)
    and whether to cache the permit data included in the YFSFetchStatus record
    by the server.

    The problem comes when the server sends us a CB.InitCallBackState op. The
    first such instance doesn't cause cb_s_break to be incremented, but rather
    causes AFS_SERVER_FL_NEW to be cleared - but thereafter, say some hours
    after last use and all the volumes have been automatically unmounted and
    the server has forgotten about the client[*], this *will* likely cause an
    increment.

    [*] There are other circumstances too, such as the server restarting or
    needing to make space in its callback table.

    Note that the server won't send us a CB.InitCallBackState op until we talk
    to it again.

    So what happens is:

    (1) A mount for a new volume is attempted, a inode is created for the root
    vnode and vnode->cb_s_break and AFS_VNODE_CB_PROMISED aren't set
    immediately, as we don't have a nominated server to talk to yet - and
    we may iterate through a few to find one.

    (2) Before the operation happens, afs_fetch_status(), say, notes in the
    cursor (fc.cb_break) the break counter sum from the vnode, volume and
    server counters, but the server->cb_s_break is currently 0.

    (3) We send FS.FetchStatus to the server. The server sends us back
    CB.InitCallBackState. We increment server->cb_s_break.

    (4) Our FS.FetchStatus completes. The reply includes a callback record.

    (5) xdr_decode_AFSCallBack()/xdr_decode_YFSCallBack() check to see whether
    the callback promise was broken by checking the break counter sum from
    step (2) against the current sum.

    This fails because of step (3), so we don't set the callback record
    and, importantly, don't set AFS_VNODE_CB_PROMISED on the vnode.

    This does not preclude the syscall from progressing, and we don't loop here
    rechecking the status, but rather assume it's good enough for one round
    only and will need to be rechecked next time.

    (6) afs_validate() it triggered on the vnode, probably called from
    d_revalidate() checking the parent directory.

    (7) afs_validate() notes that AFS_VNODE_CB_PROMISED isn't set, so doesn't
    update vnode->cb_s_break and assumes the vnode to be invalid.

    (8) afs_validate() needs to calls afs_fetch_status(). Go back to step (2)
    and repeat, every time the vnode is validated.

    This primarily affects volume root dir vnodes. Everything subsequent to
    those inherit an already incremented cb_s_break upon mounting.

    The issue is that we assume that the callback record and the cached permit
    information in a reply from the server can't be trusted after getting a
    server break - but this is wrong since the server makes sure things are
    done in the right order, holding up our ops if necessary[*].

    [*] There is an extremely unlikely scenario where a reply from before the
    CB.InitCallBackState could get its delivery deferred till after - at
    which point we think we have a promise when we don't. This, however,
    requires unlucky mass packet loss to one call.

    AFS_SERVER_FL_NEW tries to paper over the cracks for the initial mount from
    a server we've never contacted before, but this should be unnecessary.
    It's also further insulated from the problem on an initial mount by
    querying the server first with FS.GetCapabilities, which triggers the
    CB.InitCallBackState.

    Fix this by

    (1) Remove AFS_SERVER_FL_NEW.

    (2) In afs_calc_vnode_cb_break(), don't include cb_s_break in the
    calculation.

    (3) In afs_cb_is_broken(), don't include cb_s_break in the check.

    Signed-off-by: David Howells

    David Howells
     

24 Oct, 2018

3 commits

  • Implement support for talking to YFS-variant fileservers in the cache
    manager and the filesystem client. These implement upgraded services on
    the same port as their AFS services.

    YFS fileservers provide expanded capabilities over AFS.

    Signed-off-by: David Howells

    David Howells
     
  • Remove unnecessary details of a broken callback, such as version, expiry
    and type, from the afs_callback_break struct as they're not actually used
    and make the list take more memory.

    Signed-off-by: David Howells

    David Howells
     
  • Increase the sizes of the volume ID to 64 bits and the vnode ID (inode
    number equivalent) to 96 bits to allow the support of YFS.

    This requires the iget comparator to check the vnode->fid rather than i_ino
    and i_generation as i_ino is not sufficiently capacious. It also requires
    this data to be placed into the vnode cache key for fscache.

    For the moment, just discard the top 32 bits of the vnode ID when returning
    it though stat.

    Signed-off-by: David Howells

    David Howells
     

15 Jun, 2018

1 commit

  • At the moment, afs_break_callbacks calls afs_break_one_callback() for each
    separate FID it was given, and the latter looks up the volume individually
    for each one.

    However, this is inefficient if two or more FIDs have the same vid as we
    could reuse the volume. This is complicated by cell aliasing whereby we
    may have multiple cells sharing a volume and can therefore have multiple
    callback interests for any particular volume ID.

    At the moment afs_break_one_callback() scans the entire list of volumes
    we're getting from a server and breaks the appropriate callback in every
    matching volume, regardless of cell. This scan is done for every FID.

    Optimise callback breaking by the following means:

    (1) Sort the FID list by vid so that all FIDs belonging to the same volume
    are clumped together.

    This is done through the use of an indirection table as we cannot do
    an insertion sort on the afs_callback_break array as we decode FIDs
    into it as we subsequently also have to decode callback info into it
    that corresponds by array index only.

    We also don't really want to bubblesort afterwards if we can avoid it.

    (2) Sort the server->cb_interests array by vid so that all the matching
    volumes are grouped together. This permits the scan to stop after
    finding a record that has a higher vid.

    (3) When breaking FIDs, we try to keep server->cb_break_lock as long as
    possible, caching the start point in the array for that volume group
    as long as possible.

    It might make sense to add another layer in that list and have a
    refcounted volume ID anchor that has the matching interests attached
    to it rather than being in the list. This would allow the lock to be
    dropped without losing the cursor.

    Signed-off-by: David Howells

    David Howells
     

14 May, 2018

2 commits

  • It's possible for an AFS file server to issue a whole-volume notification
    that callbacks on all the vnodes in the file have been broken. This is
    done for R/O and backup volumes (which don't have per-file callbacks) and
    for things like a volume being taken offline.

    Fix callback handling to detect whole-volume notifications, to track it
    across operations and to check it during inode validation.

    Fixes: c435ee34551e ("afs: Overhaul the callback handling")
    Signed-off-by: David Howells

    David Howells
     
  • The refcounting on afs_cb_interest struct objects in
    afs_register_server_cb_interest() is wrong as it uses the server list
    entry's call back interest pointer without regard for the fact that it
    might be replaced at any time and the object thrown away.

    Fix this by:

    (1) Put a lock on the afs_server_list struct that can be used to
    mediate access to the callback interest pointers in the servers array.

    (2) Keep a ref on the callback interest that we get from the entry.

    (3) Dropping the old reference held by vnode->cb_interest if we replace
    the pointer.

    Fixes: c435ee34551e ("afs: Overhaul the callback handling")
    Signed-off-by: David Howells

    David Howells
     

10 Apr, 2018

3 commits

  • Processes like ld that do lots of small writes that aren't necessarily
    contiguous result in a lot of small StoreData operations to the server, the
    idea being that if someone else changes the data on the server, we only
    write our changes over that and not the space between. Further, we don't
    want to write back empty space if we can avoid it to make it easier for the
    server to do sparse files.

    However, making lots of tiny RPC ops is a lot less efficient for the server
    than one big one because each op requires allocation of resources and the
    taking of locks, so we want to compromise a bit.

    Reduce the load by the following:

    (1) If a file is just created locally or has just been truncated with
    O_TRUNC locally, allow subsequent writes to the file to be merged with
    intervening space if that space doesn't cross an entire intervening
    page.

    (2) Don't flush the file on ->flush() but rather on ->release() if the
    file was open for writing.

    Just linking vmlinux.o, without this patch, looking in /proc/fs/afs/stats:

    file-wr : n=441 nb=513581204

    and after the patch:

    file-wr : n=62 nb=513668555

    there were 379 fewer StoreData RPC operations at the expense of an extra
    87K being written.

    Signed-off-by: David Howells

    David Howells
     
  • When afs_lookup() is called, prospectively look up the next 50 uncached
    fids also from that same directory and cache the results, rather than just
    looking up the one file requested.

    This allows us to use the FS.InlineBulkStatus RPC op to increase efficiency
    by fetching up to 50 file statuses at a time.

    Signed-off-by: David Howells

    David Howells
     
  • Fix warnings raised by checker, including:

    (*) Warnings raised by unequal comparison for the purposes of sorting,
    where the endianness doesn't matter:

    fs/afs/addr_list.c:246:21: warning: restricted __be16 degrades to integer
    fs/afs/addr_list.c:246:30: warning: restricted __be16 degrades to integer
    fs/afs/addr_list.c:248:21: warning: restricted __be32 degrades to integer
    fs/afs/addr_list.c:248:49: warning: restricted __be32 degrades to integer
    fs/afs/addr_list.c:283:21: warning: restricted __be16 degrades to integer
    fs/afs/addr_list.c:283:30: warning: restricted __be16 degrades to integer

    (*) afs_set_cb_interest() is not actually used and can be removed.

    (*) afs_cell_gc_delay() should be provided with a sysctl.

    (*) afs_cell_destroy() needs to use rcu_access_pointer() to read
    cell->vl_addrs.

    (*) afs_init_fs_cursor() should be static.

    (*) struct afs_vnode::permit_cache needs to be marked __rcu.

    (*) afs_server_rcu() needs to use rcu_access_pointer().

    (*) afs_destroy_server() should use rcu_access_pointer() on
    server->addresses as the server object is no longer accessible.

    (*) afs_find_server() casts __be16/__be32 values to int in order to
    directly compare them for the purpose of finding a match in a list,
    but is should also annotate the cast with __force to avoid checker
    warnings.

    (*) afs_check_permit() accesses vnode->permit_cache outside of the RCU
    readlock, though it doesn't then access the value; the extraneous
    access is deleted.

    False positives:

    (*) Conditional locking around the code in xdr_decode_AFSFetchStatus. This
    can be dealt with in a separate patch.

    fs/afs/fsclient.c:148:9: warning: context imbalance in 'xdr_decode_AFSFetchStatus' - different lock contexts for basic block

    (*) Incorrect handling of seq-retry lock context balance:

    fs/afs/inode.c:455:38: warning: context imbalance in 'afs_getattr' - different
    lock contexts for basic block
    fs/afs/server.c:52:17: warning: context imbalance in 'afs_find_server' - different lock contexts for basic block
    fs/afs/server.c:128:17: warning: context imbalance in 'afs_find_server_by_uuid' - different lock contexts for basic block

    Errors:

    (*) afs_lookup_cell_rcu() needs to break out of the seq-retry loop, not go
    round again if it successfully found the workstation cell.

    (*) Fix UUID decode in afs_deliver_cb_probe_uuid().

    (*) afs_cache_permit() has a missing rcu_read_unlock() before one of the
    jumps to the someone_else_changed_it label. Move the unlock to after
    the label.

    (*) afs_vl_get_addrs_u() is using ntohl() rather than htonl() when
    encoding to XDR.

    (*) afs_deliver_yfsvl_get_endpoints() is using htonl() rather than ntohl()
    when decoding from XDR.

    Signed-off-by: David Howells

    David Howells
     

13 Nov, 2017

3 commits

  • The current code assumes that volumes and servers are per-cell and are
    never shared, but this is not enforced, and, indeed, public cells do exist
    that are aliases of each other. Further, an organisation can, say, set up
    a public cell and a private cell with overlapping, but not identical, sets
    of servers. The difference is purely in the database attached to the VL
    servers.

    The current code will malfunction if it sees a server in two cells as it
    assumes global address -> server record mappings and that each server is in
    just one cell.

    Further, each server may have multiple addresses - and may have addresses
    of different families (IPv4 and IPv6, say).

    To this end, the following structural changes are made:

    (1) Server record management is overhauled:

    (a) Server records are made independent of cell. The namespace keeps
    track of them, volume records have lists of them and each vnode
    has a server on which its callback interest currently resides.

    (b) The cell record no longer keeps a list of servers known to be in
    that cell.

    (c) The server records are now kept in a flat list because there's no
    single address to sort on.

    (d) Server records are now keyed by their UUID within the namespace.

    (e) The addresses for a server are obtained with the VL.GetAddrsU
    rather than with VL.GetEntryByName, using the server's UUID as a
    parameter.

    (f) Cached server records are garbage collected after a period of
    non-use and are counted out of existence before purging is allowed
    to complete. This protects the work functions against rmmod.

    (g) The servers list is now in /proc/fs/afs/servers.

    (2) Volume record management is overhauled:

    (a) An RCU-replaceable server list is introduced. This tracks both
    servers and their coresponding callback interests.

    (b) The superblock is now keyed on cell record and numeric volume ID.

    (c) The volume record is now tied to the superblock which mounts it,
    and is activated when mounted and deactivated when unmounted.
    This makes it easier to handle the cache cookie without causing a
    double-use in fscache.

    (d) The volume record is loaded from the VLDB using VL.GetEntryByNameU
    to get the server UUID list.

    (e) The volume name is updated if it is seen to have changed when the
    volume is updated (the update is keyed on the volume ID).

    (3) The vlocation record is got rid of and VLDB records are no longer
    cached. Sufficient information is stored in the volume record, though
    an update to a volume record is now no longer shared between related
    volumes (volumes come in bundles of three: R/W, R/O and backup).

    and the following procedural changes are made:

    (1) The fileserver cursor introduced previously is now fleshed out and
    used to iterate over fileservers and their addresses.

    (2) Volume status is checked during iteration, and the server list is
    replaced if a change is detected.

    (3) Server status is checked during iteration, and the address list is
    replaced if a change is detected.

    (4) The abort code is saved into the address list cursor and -ECONNABORTED
    returned in afs_make_call() if a remote abort happened rather than
    translating the abort into an error message. This allows actions to
    be taken depending on the abort code more easily.

    (a) If a VMOVED abort is seen then this is handled by rechecking the
    volume and restarting the iteration.

    (b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is
    handled by sleeping for a short period and retrying and/or trying
    other servers that might serve that volume. A message is also
    displayed once until the condition has cleared.

    (c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the
    moment.

    (d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to
    see if it has been deleted; if not, the fileserver is probably
    indicating that the volume couldn't be attached and needs
    salvaging.

    (e) If statfs() sees one of these aborts, it does not sleep, but
    rather returns an error, so as not to block the umount program.

    (5) The fileserver iteration functions in vnode.c are now merged into
    their callers and more heavily macroised around the cursor. vnode.c
    is removed.

    (6) Operations on a particular vnode are serialised on that vnode because
    the server will lock that vnode whilst it operates on it, so a second
    op sent will just have to wait.

    (7) Fileservers are probed with FS.GetCapabilities before being used.
    This is where service upgrade will be done.

    (8) A callback interest on a fileserver is set up before an FS operation
    is performed and passed through to afs_make_call() so that it can be
    set on the vnode if the operation returns a callback. The callback
    interest is passed through to afs_iget() also so that it can be set
    there too.

    In general, record updating is done on an as-needed basis when we try to
    access servers, volumes or vnodes rather than offloading it to work items
    and special threads.

    Notes:

    (1) Pre AFS-3.4 servers are no longer supported, though this can be added
    back if necessary (AFS-3.4 was released in 1998).

    (2) VBUSY is retried forever for the moment at intervals of 1s.

    (3) /proc/fs/afs//servers no longer exists.

    Signed-off-by: David Howells

    David Howells
     
  • Overhaul the AFS callback handling by the following means:

    (1) Don't give up callback promises on vnodes that we are no longer using,
    rather let them just expire on the server or let the server break
    them. This is actually more efficient for the server as the callback
    lookup is expensive if there are lots of extant callbacks.

    (2) Only give up the callback promises we have from a server when the
    server record is destroyed. Then we can just give up *all* the
    callback promises on it in one go.

    (3) Servers can end up being shared between cells if cells are aliased, so
    don't add all the vnodes being backed by a particular server into a
    big FID-indexed tree on that server as there may be duplicates.

    Instead have each volume instance (~= superblock) register an interest
    in a server as it starts to make use of it and use this to allow the
    processor for callbacks from the server to find the superblock and
    thence the inode corresponding to the FID being broken by means of
    ilookup_nowait().

    (4) Rather than iterating over the entire callback list when a mass-break
    comes in from the server, maintain a counter of mass-breaks in
    afs_server (cb_seq) and make afs_validate() check it against the copy
    in afs_vnode.

    It would be nice not to have to take a read_lock whilst doing this,
    but that's tricky without using RCU.

    (5) Save a ref on the fileserver we're using for a call in the afs_call
    struct so that we can access its cb_s_break during call decoding.

    (6) Write-lock around callback and status storage in a vnode and read-lock
    around getattr so that we don't see the status mid-update.

    This has the following consequences:

    (1) Data invalidation isn't seen until someone calls afs_validate() on a
    vnode. Unfortunately, we need to use a key to query the server, but
    getting one from a background thread is tricky without caching loads
    of keys all over the place.

    (2) Mass invalidation isn't seen until someone calls afs_validate().

    (3) Callback breaking is going to hit the inode_hash_lock quite a bit.
    Could this be replaced with rcu_read_lock() since inodes are destroyed
    under RCU conditions.

    Signed-off-by: David Howells

    David Howells
     
  • Lay the groundwork for supporting network namespaces (netns) to the AFS
    filesystem by moving various global features to a network-namespace struct
    (afs_net) and providing an instance of this as a temporary global variable
    that everything uses via accessor functions for the moment.

    The following changes have been made:

    (1) Store the netns in the superblock info. This will be obtained from
    the mounter's nsproxy on a manual mount and inherited from the parent
    superblock on an automount.

    (2) The cell list is made per-netns. It can be viewed through
    /proc/net/afs/cells and also be modified by writing commands to that
    file.

    (3) The local workstation cell is set per-ns in /proc/net/afs/rootcell.
    This is unset by default.

    (4) The 'rootcell' module parameter, which sets a cell and VL server list
    modifies the init net namespace, thereby allowing an AFS root fs to be
    theoretically used.

    (5) The volume location lists and the file lock manager are made
    per-netns.

    (6) The AF_RXRPC socket and associated I/O bits are made per-ns.

    The various workqueues remain global for the moment.

    Changes still to be made:

    (1) /proc/fs/afs/ should be moved to /proc/net/afs/ and a symlink emplaced
    from the old name.

    (2) A per-netns subsys needs to be registered for AFS into which it can
    store its per-netns data.

    (3) Rather than the AF_RXRPC socket being opened on module init, it needs
    to be opened on the creation of a superblock in that netns.

    (4) The socket needs to be closed when the last superblock using it is
    destroyed and all outstanding client calls on it have been completed.
    This prevents a reference loop on the namespace.

    (5) It is possible that several namespaces will want to use AFS, in which
    case each one will need its own UDP port. These can either be set
    through /proc/net/afs/cm_port or the kernel can pick one at random.
    The init_ns gets 7001 by default.

    Other issues that need resolving:

    (1) The DNS keyring needs net-namespacing.

    (2) Where do upcalls go (eg. DNS request-key upcall)?

    (3) Need something like open_socket_in_file_ns() syscall so that AFS
    command line tools attempting to operate on an AFS file/volume have
    their RPC calls go to the right place.

    Signed-off-by: David Howells

    David Howells
     

17 Mar, 2017

1 commit

  • get_seconds() returns real wall-clock seconds. On 32-bit systems
    this value will overflow in year 2038 and beyond. This patch changes
    afs's vlocation record to use ktime_get_real_seconds() instead, for the
    fields time_of_death and update_at.

    Signed-off-by: Tina Ruchandani
    Signed-off-by: David Howells

    Tina Ruchandani
     

09 Jan, 2017

1 commit

  • The afs_wait_mode struct isn't really necessary. Client calls only use one
    of a choice of two (synchronous or the asynchronous) and incoming calls
    don't use the wait at all. Replace with a boolean parameter.

    Signed-off-by: David Howells

    David Howells
     

05 Sep, 2016

1 commit

  • The workqueue "afs_callback_update_worker" queues multiple work items
    viz &vnode->cb_broken_work, &server->cb_break_work which require strict
    execution ordering. Hence, an ordered dedicated workqueue has been used.

    Since the workqueue is being used on a memory reclaim path, WQ_MEM_RECLAIM
    has been set to ensure forward progress under memory pressure.

    Signed-off-by: Bhaktipriya Shridhar
    Signed-off-by: David Howells

    Bhaktipriya Shridhar
     

14 Aug, 2012

1 commit

  • Convert delayed_work users doing cancel_delayed_work() followed by
    queue_delayed_work() to mod_delayed_work().

    Most conversions are straight-forward. Ones worth mentioning are,

    * drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
    use mod_delayed_work() and cancel loop in
    edac_mc_reset_delay_period() is dropped.

    * drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
    watchdog is active or not. @fan_watchdog_active and related code
    dropped.

    * drivers/power/charger-manager.c: Seemingly a lot of
    delayed_work_pending() abuse going on here.
    [delayed_]work_pending() are unsynchronized and racy when used like
    this. I converted one instance in fullbatt_handler(). Please
    conver the rest so that it invokes workqueue APIs for the intended
    target state rather than trying to game work item pending state
    transitions. e.g. if timer should be modified - call
    mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().

    * drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
    simplified. Note that round_jiffies() calls in this function are
    meaningless. round_jiffies() work on absolute jiffies not delta
    delay used by delayed_work.

    v2: Tomi pointed out that __cancel_delayed_work() users can't be
    safely converted to mod_delayed_work(). They could be calling it
    from irq context and if that happens while delayed_work_timer_fn()
    is running, it could deadlock. __cancel_delayed_work() users are
    dropped.

    Signed-off-by: Tejun Heo
    Acked-by: Henrique de Moraes Holschuh
    Acked-by: Dmitry Torokhov
    Acked-by: Anton Vorontsov
    Acked-by: David Howells
    Cc: Tomi Valkeinen
    Cc: Jens Axboe
    Cc: Jiri Kosina
    Cc: Doug Thompson
    Cc: David Airlie
    Cc: Roland Dreier
    Cc: "John W. Linville"
    Cc: Zhang Rui
    Cc: Len Brown
    Cc: "J. Bruce Fields"
    Cc: Johannes Berg

    Tejun Heo
     

07 Jun, 2008

1 commit


17 Oct, 2007

1 commit

  • This patch contains the following possible cleanups:
    - make the following needlessly global functions static:
    - rxrpc.c: afs_send_pages()
    - vlocation.c: afs_vlocation_queue_for_updates()
    - write.c: afs_writepages_region()
    - make the following needlessly global variables static:
    - mntpt.c: afs_mntpt_expiry_timeout
    - proc.c: afs_vlocation_states[]
    - server.c: afs_server_timeout
    - vlocation.c: afs_vlocation_timeout
    - vlocation.c: afs_vlocation_update_timeout
    - #if 0 the following unused function:
    - cell.c: afs_get_cell_maybe()
    - #if 0 the following unused variables:
    - callback.c: afs_vnode_update_timeout
    - cmservice.c: struct afs_cm_workqueue

    Signed-off-by: Adrian Bunk
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

17 Jul, 2007

1 commit


22 May, 2007

1 commit

  • First thing mm.h does is including sched.h solely for can_do_mlock() inline
    function which has "current" dereference inside. By dealing with can_do_mlock()
    mm.h can be detached from sched.h which is good. See below, why.

    This patch
    a) removes unconditional inclusion of sched.h from mm.h
    b) makes can_do_mlock() normal function in mm/mlock.c
    c) exports can_do_mlock() to not break compilation
    d) adds sched.h inclusions back to files that were getting it indirectly.
    e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
    getting them indirectly

    Net result is:
    a) mm.h users would get less code to open, read, preprocess, parse, ... if
    they don't need sched.h
    b) sched.h stops being dependency for significant number of files:
    on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
    after patch it's only 3744 (-8.3%).

    Cross-compile tested on

    all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
    alpha alpha-up
    arm
    i386 i386-up i386-defconfig i386-allnoconfig
    ia64 ia64-up
    m68k
    mips
    parisc parisc-up
    powerpc powerpc-up
    s390 s390-up
    sparc sparc-up
    sparc64 sparc64-up
    um-x86_64
    x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig

    as well as my two usual configs.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

10 May, 2007

1 commit

  • Make some miscellaneous changes to the AFS filesystem:

    (1) Assert RCU barriers on module exit to make sure RCU has finished with
    callbacks in this module.

    (2) Correctly handle the AFS server returning a zero-length read.

    (3) Split out data zapping calls into one function (afs_zap_data).

    (4) Rename some afs_file_*() functions to afs_*() where they apply to
    non-regular files too.

    (5) Be consistent about the presentation of volume ID:vnode ID in debugging
    output.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

03 May, 2007

1 commit


27 Apr, 2007

4 commits

  • Add support for the create, link, symlink, unlink, mkdir, rmdir and
    rename VFS operations to the in-kernel AFS filesystem.

    Also:

    (1) Fix dentry and inode revalidation. d_revalidate should only look at
    state of the dentry. Revalidation of the contents of an inode pointed to
    by a dentry is now separate.

    (2) Fix afs_lookup() to hash negative dentries as well as positive ones.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     
  • Add security support to the AFS filesystem. Kerberos IV tickets are added as
    RxRPC keys are added to the session keyring with the klog program. open() and
    other VFS operations then find this ticket with request_key() and either use
    it immediately (eg: mkdir, unlink) or attach it to a file descriptor (open).

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     
  • Make the in-kernel AFS filesystem use AF_RXRPC instead of the old RxRPC code.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     
  • Clean up the AFS sources.

    Also remove references to AFS keys. RxRPC keys are used instead.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

08 Nov, 2005

1 commit


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds