21 Aug, 2020

1 commit


11 May, 2020

1 commit

  • rxrpc currently uses a fixed 4s retransmission timeout until the RTT is
    sufficiently sampled. This can cause problems with some fileservers with
    calls to the cache manager in the afs filesystem being dropped from the
    fileserver because a packet goes missing and the retransmission timeout is
    greater than the call expiry timeout.

    Fix this by:

    (1) Copying the RTT/RTO calculation code from Linux's TCP implementation
    and altering it to fit rxrpc.

    (2) Altering the various users of the RTT to make use of the new SRTT
    value.

    (3) Replacing the use of rxrpc_resend_timeout to use the calculated RTO
    value instead (which is needed in jiffies), along with a backoff.

    Notes:

    (1) rxrpc provides RTT samples by matching the serial numbers on outgoing
    DATA packets that have the RXRPC_REQUEST_ACK set and PING ACK packets
    against the reference serial number in incoming REQUESTED ACK and
    PING-RESPONSE ACK packets.

    (2) Each packet that is transmitted on an rxrpc connection gets a new
    per-connection serial number, even for retransmissions, so an ACK can
    be cross-referenced to a specific trigger packet. This allows RTT
    information to be drawn from retransmitted DATA packets also.

    (3) rxrpc maintains the RTT/RTO state on the rxrpc_peer record rather than
    on an rxrpc_call because many RPC calls won't live long enough to
    generate more than one sample.

    (4) The calculated SRTT value is in units of 8ths of a microsecond rather
    than nanoseconds.

    The (S)RTT and RTO values are displayed in /proc/net/rxrpc/peers.

    Fixes: 17926a79320a ([AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both"")
    Signed-off-by: David Howells

    David Howells
     

21 Oct, 2019

1 commit


07 Oct, 2019

2 commits

  • The rxrpc_peer record needs to hold a reference on the rxrpc_local record
    it points as the peer is used as a base to access information in the
    rxrpc_local record.

    This can cause problems in __rxrpc_put_peer(), where we need the network
    namespace pointer, and in rxrpc_send_keepalive(), where we need to access
    the UDP socket, leading to symptoms like:

    BUG: KASAN: use-after-free in __rxrpc_put_peer net/rxrpc/peer_object.c:411
    [inline]
    BUG: KASAN: use-after-free in rxrpc_put_peer+0x685/0x6a0
    net/rxrpc/peer_object.c:435
    Read of size 8 at addr ffff888097ec0058 by task syz-executor823/24216

    Fix this by taking a ref on the local record for the peer record.

    Fixes: ace45bec6d77 ("rxrpc: Fix firewall route keepalive")
    Fixes: 2baec2c3f854 ("rxrpc: Support network namespacing")
    Reported-by: syzbot+b9be979c55f2bea8ed30@syzkaller.appspotmail.com
    Signed-off-by: David Howells

    David Howells
     
  • rxrpc_put_peer() calls trace_rxrpc_peer() after it has done the decrement
    of the refcount - which looks at the debug_id in the peer record. But
    unless the refcount was reduced to zero, we no longer have the right to
    look in the record and, indeed, it may be deleted by some other thread.

    Fix this by getting the debug_id out before decrementing the refcount and
    then passing that into the tracepoint.

    This can cause the following symptoms:

    BUG: KASAN: use-after-free in __rxrpc_put_peer net/rxrpc/peer_object.c:411
    [inline]
    BUG: KASAN: use-after-free in rxrpc_put_peer+0x685/0x6a0
    net/rxrpc/peer_object.c:435
    Read of size 8 at addr ffff888097ec0058 by task syz-executor823/24216

    Fixes: 1159d4b496f5 ("rxrpc: Add a tracepoint to track rxrpc_peer refcounting")
    Reported-by: syzbot+b9be979c55f2bea8ed30@syzkaller.appspotmail.com
    Signed-off-by: David Howells

    David Howells
     

05 Oct, 2019

1 commit


30 Jul, 2019

1 commit

  • There is a potential deadlock in rxrpc_peer_keepalive_dispatch() whereby
    rxrpc_put_peer() is called with the peer_hash_lock held, but if it reduces
    the peer's refcount to 0, rxrpc_put_peer() calls __rxrpc_put_peer() - which
    the tries to take the already held lock.

    Fix this by providing a version of rxrpc_put_peer() that can be called in
    situations where the lock is already held.

    The bug may produce the following lockdep report:

    ============================================
    WARNING: possible recursive locking detected
    5.2.0-next-20190718 #41 Not tainted
    --------------------------------------------
    kworker/0:3/21678 is trying to acquire lock:
    00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at: spin_lock_bh
    /./include/linux/spinlock.h:343 [inline]
    00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
    __rxrpc_put_peer /net/rxrpc/peer_object.c:415 [inline]
    00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
    rxrpc_put_peer+0x2d3/0x6a0 /net/rxrpc/peer_object.c:435

    but task is already holding lock:
    00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at: spin_lock_bh
    /./include/linux/spinlock.h:343 [inline]
    00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
    rxrpc_peer_keepalive_dispatch /net/rxrpc/peer_event.c:378 [inline]
    00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
    rxrpc_peer_keepalive_worker+0x6b3/0xd02 /net/rxrpc/peer_event.c:430

    Fixes: 330bdcfadcee ("rxrpc: Fix the keepalive generator [ver #2]")
    Reported-by: syzbot+72af434e4b3417318f84@syzkaller.appspotmail.com
    Signed-off-by: David Howells
    Reviewed-by: Marc Dionne
    Reviewed-by: Jeffrey Altman

    David Howells
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

09 Oct, 2018

1 commit

  • The rxrpc_input_packet() function and its call tree was built around the
    assumption that data_ready() handler called from UDP to inform a kernel
    service that there is data to be had was non-reentrant. This means that
    certain locking could be dispensed with.

    This, however, turns out not to be the case with a multi-queue network card
    that can deliver packets to multiple cpus simultaneously. Each of those
    cpus can be in the rxrpc_input_packet() function at the same time.

    Fix by adding or changing some structure members:

    (1) Add peer->rtt_input_lock to serialise access to the RTT buffer.

    (2) Make conn->service_id into a 32-bit variable so that it can be
    cmpxchg'd on all arches.

    (3) Add call->input_lock to serialise access to the Rx/Tx state. Note
    that although the Rx and Tx states are (almost) entirely separate,
    there's no point completing the separation and having separate locks
    since it's a bi-phasal RPC protocol rather than a bi-direction
    streaming protocol. Data transmission and data reception do not take
    place simultaneously on any particular call.

    and making the following functional changes:

    (1) In rxrpc_input_data(), hold call->input_lock around the core to
    prevent simultaneous producing of packets into the Rx ring and
    updating of tracking state for a particular call.

    (2) In rxrpc_input_ping_response(), only read call->ping_serial once, and
    check it before checking RXRPC_CALL_PINGING as that's a cheaper test.
    The bit test and bit clear can then be combined. No further locking
    is needed here.

    (3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of
    the ACK packet. The superseded ACK check is then done both before and
    after the lock is taken.

    The handing of ackinfo data is split, parsing before the lock is taken
    and processing with it held. This is keyed on rxMTU being non-zero.

    Congestion management is also done within the locked section.

    (4) In rxrpc_input_ackall(), take call->input_lock around the Tx window
    rotation. The ACKALL packet carries no information and is only really
    useful after all packets have been transmitted since it's imprecise.

    (5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to
    prevent calls being simultaneously implicitly ended on two cpus and
    also to prevent any races with incoming call setup.

    (6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade
    on a connection. It is only permitted to happen once for a
    connection.

    (7) In rxrpc_new_incoming_call(), we have to recheck the routing inside
    rx->incoming_lock to see if someone else set up the call, connection
    or peer whilst we were getting there. We can't trust the values from
    the earlier routing check unless we pin refs on them - which we want
    to avoid.

    Further, we need to allow for an incoming call to have its state
    changed on another CPU between us making it live and us adjusting it
    because the conn is now in the RXRPC_CONN_SERVICE state.

    (8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access
    to the RTT buffer. Don't need to lock around setting peer->rtt.

    For reference, the inventory of state-accessing or state-altering functions
    used by the packet input procedure is:

    > rxrpc_input_packet()
    * PACKET CHECKING

    * ROUTING
    > rxrpc_post_packet_to_local()
    > rxrpc_find_connection_rcu() - uses RCU
    > rxrpc_lookup_peer_rcu() - uses RCU
    > rxrpc_find_service_conn_rcu() - uses RCU
    > idr_find() - uses RCU

    * CONNECTION-LEVEL PROCESSING
    - Service upgrade
    - Can only happen once per conn
    ! Changed to use cmpxchg
    > rxrpc_post_packet_to_conn()
    - Setting conn->hi_serial
    - Probably safe not using locks
    - Maybe use cmpxchg

    * CALL-LEVEL PROCESSING
    > Old-call checking
    > rxrpc_input_implicit_end_call()
    > rxrpc_call_completed()
    > rxrpc_queue_call()
    ! Need to take rx->incoming_lock
    > __rxrpc_disconnect_call()
    > rxrpc_notify_socket()
    > rxrpc_new_incoming_call()
    - Uses rx->incoming_lock for the entire process
    - Might be able to drop this earlier in favour of the call lock
    > rxrpc_incoming_call()
    ! Conflicts with rxrpc_input_implicit_end_call()
    > rxrpc_send_ping()
    - Don't need locks to check rtt state
    > rxrpc_propose_ACK

    * PACKET DISTRIBUTION
    > rxrpc_input_call_packet()
    > rxrpc_input_data()
    * QUEUE DATA PACKET ON CALL
    > rxrpc_reduce_call_timer()
    - Uses timer_reduce()
    ! Needs call->input_lock()
    > rxrpc_receiving_reply()
    ! Needs locking around ack state
    > rxrpc_rotate_tx_window()
    > rxrpc_end_tx_phase()
    > rxrpc_proto_abort()
    > rxrpc_input_dup_data()
    - Fills the Rx buffer
    - rxrpc_propose_ACK()
    - rxrpc_notify_socket()

    > rxrpc_input_ack()
    * APPLY ACK PACKET TO CALL AND DISCARD PACKET
    > rxrpc_input_ping_response()
    - Probably doesn't need any extra locking
    ! Need READ_ONCE() on call->ping_serial
    > rxrpc_input_check_for_lost_ack()
    - Takes call->lock to consult Tx buffer
    > rxrpc_peer_add_rtt()
    ! Needs to take a lock (peer->rtt_input_lock)
    ! Could perhaps manage with cmpxchg() and xadd() instead
    > rxrpc_input_requested_ack
    - Consults Tx buffer
    ! Probably needs a lock
    > rxrpc_peer_add_rtt()
    > rxrpc_propose_ack()
    > rxrpc_input_ackinfo()
    - Changes call->tx_winsize
    ! Use cmpxchg to handle change
    ! Should perhaps track serial number
    - Uses peer->lock to record MTU specification changes
    > rxrpc_proto_abort()
    ! Need to take call->input_lock
    > rxrpc_rotate_tx_window()
    > rxrpc_end_tx_phase()
    > rxrpc_input_soft_acks()
    - Consults the Tx buffer
    > rxrpc_congestion_management()
    - Modifies the Tx annotations
    ! Needs call->input_lock()
    > rxrpc_queue_call()

    > rxrpc_input_abort()
    * APPLY ABORT PACKET TO CALL AND DISCARD PACKET
    > rxrpc_set_call_completion()
    > rxrpc_notify_socket()

    > rxrpc_input_ackall()
    * APPLY ACKALL PACKET TO CALL AND DISCARD PACKET
    ! Need to take call->input_lock
    > rxrpc_rotate_tx_window()
    > rxrpc_end_tx_phase()

    > rxrpc_reject_packet()

    There are some functions used by the above that queue the packet, after
    which the procedure is terminated:

    - rxrpc_post_packet_to_local()
    - local->event_queue is an sk_buff_head
    - local->processor is a work_struct
    - rxrpc_post_packet_to_conn()
    - conn->rx_queue is an sk_buff_head
    - conn->processor is a work_struct
    - rxrpc_reject_packet()
    - local->reject_queue is an sk_buff_head
    - local->processor is a work_struct

    And some that offload processing to process context:

    - rxrpc_notify_socket()
    - Uses RCU lock
    - Uses call->notify_lock to call call->notify_rx
    - Uses call->recvmsg_lock to queue recvmsg side
    - rxrpc_queue_call()
    - call->processor is a work_struct
    - rxrpc_propose_ACK()
    - Uses call->lock to wrap __rxrpc_propose_ACK()

    And a bunch that complete a call, all of which use call->state_lock to
    protect the call state:

    - rxrpc_call_completed()
    - rxrpc_set_call_completion()
    - rxrpc_abort_call()
    - rxrpc_proto_abort()
    - Also uses rxrpc_queue_call()

    Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
    Signed-off-by: David Howells

    David Howells
     

05 Oct, 2018

1 commit


28 Sep, 2018

2 commits

  • Fix error distribution by immediately delivering the errors to all the
    affected calls rather than deferring them to a worker thread. The problem
    with the latter is that retries and things can happen in the meantime when we
    want to stop that sooner.

    To this end:

    (1) Stop the error distributor from removing calls from the error_targets
    list so that peer->lock isn't needed to synchronise against other adds
    and removals.

    (2) Require the peer's error_targets list to be accessed with RCU, thereby
    avoiding the need to take peer->lock over distribution.

    (3) Don't attempt to affect a call's state if it is already marked complete.

    Signed-off-by: David Howells

    David Howells
     
  • Make the following changes to improve the robustness of the code that sets
    up a new service call:

    (1) Cache the rxrpc_sock struct obtained in rxrpc_data_ready() to do a
    service ID check and pass that along to rxrpc_new_incoming_call().
    This means that I can remove the check from rxrpc_new_incoming_call()
    without the need to worry about the socket attached to the local
    endpoint getting replaced - which would invalidate the check.

    (2) Cache the rxrpc_peer struct, thereby allowing the peer search to be
    done once. The peer is passed to rxrpc_new_incoming_call(), thereby
    saving the need to repeat the search.

    This also reduces the possibility of rxrpc_publish_service_conn()
    BUG()'ing due to the detection of a duplicate connection, despite the
    initial search done by rxrpc_find_connection_rcu() having turned up
    nothing.

    This BUG() shouldn't ever get hit since rxrpc_data_ready() *should* be
    non-reentrant and the result of the initial search should still hold
    true, but it has proven possible to hit.

    I *think* this may be due to __rxrpc_lookup_peer_rcu() cutting short
    the iteration over the hash table if it finds a matching peer with a
    zero usage count, but I don't know for sure since it's only ever been
    hit once that I know of.

    Another possibility is that a bug in rxrpc_data_ready() that checked
    the wrong byte in the header for the RXRPC_CLIENT_INITIATED flag
    might've let through a packet that caused a spurious and invalid call
    to be set up. That is addressed in another patch.

    (3) Fix __rxrpc_lookup_peer_rcu() to skip peer records that have a zero
    usage count rather than stopping and returning not found, just in case
    there's another peer record behind it in the bucket.

    (4) Don't search the peer records in rxrpc_alloc_incoming_call(), but
    rather either use the peer cached in (2) or, if one wasn't found,
    preemptively install a new one.

    Fixes: 8496af50eb38 ("rxrpc: Use RCU to access a peer's service connection tree")
    Signed-off-by: David Howells

    David Howells
     

14 Aug, 2018

1 commit

  • Pull locking/atomics update from Thomas Gleixner:
    "The locking, atomics and memory model brains delivered:

    - A larger update to the atomics code which reworks the ordering
    barriers, consolidates the atomic primitives, provides the new
    atomic64_fetch_add_unless() primitive and cleans up the include
    hell.

    - Simplify cmpxchg() instrumentation and add instrumentation for
    xchg() and cmpxchg_double().

    - Updates to the memory model and documentation"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
    locking/atomics: Rework ordering barriers
    locking/atomics: Instrument cmpxchg_double*()
    locking/atomics: Instrument xchg()
    locking/atomics: Simplify cmpxchg() instrumentation
    locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
    tools/memory-model: Rename litmus tests to comply to norm7
    tools/memory-model/Documentation: Fix typo, smb->smp
    sched/Documentation: Update wake_up() & co. memory-barrier guarantees
    locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
    sched/core: Use smp_mb() in wake_woken_function()
    tools/memory-model: Add informal LKMM documentation to MAINTAINERS
    locking/atomics/Documentation: Describe atomic_set() as a write operation
    tools/memory-model: Make scripts executable
    tools/memory-model: Remove ACCESS_ONCE() from model
    tools/memory-model: Remove ACCESS_ONCE() from recipes
    locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
    MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
    tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
    tools/memory-model: Add litmus test for full multicopy atomicity
    locking/refcount: Always allow checked forms
    ...

    Linus Torvalds
     

09 Aug, 2018

1 commit

  • AF_RXRPC has a keepalive message generator that generates a message for a
    peer ~20s after the last transmission to that peer to keep firewall ports
    open. The implementation is incorrect in the following ways:

    (1) It mixes up ktime_t and time64_t types.

    (2) It uses ktime_get_real(), the output of which may jump forward or
    backward due to adjustments to the time of day.

    (3) If the current time jumps forward too much or jumps backwards, the
    generator function will crank the base of the time ring round one slot
    at a time (ie. a 1s period) until it catches up, spewing out VERSION
    packets as it goes.

    Fix the problem by:

    (1) Only using time64_t. There's no need for sub-second resolution.

    (2) Use ktime_get_seconds() rather than ktime_get_real() so that time
    isn't perceived to go backwards.

    (3) Simplifying rxrpc_peer_keepalive_worker() by splitting it into two
    parts:

    (a) The "worker" function that manages the buckets and the timer.

    (b) The "dispatch" function that takes the pending peers and
    potentially transmits a keepalive packet before putting them back
    in the ring into the slot appropriate to the revised last-Tx time.

    (4) Taking everything that's pending out of the ring and splicing it into
    a temporary collector list for processing.

    In the case that there's been a significant jump forward, the ring
    gets entirely emptied and then the time base can be warped forward
    before the peers are processed.

    The warping can't happen if the ring isn't empty because the slot a
    peer is in is keepalive-time dependent, relative to the base time.

    (5) Limit the number of iterations of the bucket array when scanning it.

    (6) Set the timer to skip any empty slots as there's no point waking up if
    there's nothing to do yet.

    This can be triggered by an incoming call from a server after a reboot with
    AF_RXRPC and AFS built into the kernel causing a peer record to be set up
    before userspace is started. The system clock is then adjusted by
    userspace, thereby potentially causing the keepalive generator to have a
    meltdown - which leads to a message like:

    watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:23]
    ...
    Workqueue: krxrpcd rxrpc_peer_keepalive_worker
    EIP: lock_acquire+0x69/0x80
    ...
    Call Trace:
    ? rxrpc_peer_keepalive_worker+0x5e/0x350
    ? _raw_spin_lock_bh+0x29/0x60
    ? rxrpc_peer_keepalive_worker+0x5e/0x350
    ? rxrpc_peer_keepalive_worker+0x5e/0x350
    ? __lock_acquire+0x3d3/0x870
    ? process_one_work+0x110/0x340
    ? process_one_work+0x166/0x340
    ? process_one_work+0x110/0x340
    ? worker_thread+0x39/0x3c0
    ? kthread+0xdb/0x110
    ? cancel_delayed_work+0x90/0x90
    ? kthread_stop+0x70/0x70
    ? ret_from_fork+0x19/0x24

    Fixes: ace45bec6d77 ("rxrpc: Fix firewall route keepalive")
    Reported-by: kernel test robot
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

21 Jun, 2018

1 commit

  • While __atomic_add_unless() was originally intended as a building-block
    for atomic_add_unless(), it's now used in a number of places around the
    kernel. It's the only common atomic operation named __atomic*(), rather
    than atomic_*(), and for consistency it would be better named
    atomic_fetch_add_unless().

    This lack of consistency is slightly confusing, and gets in the way of
    scripting atomics. Given that, let's clean things up and promote it to
    an official part of the atomics API, in the form of
    atomic_fetch_add_unless().

    This patch converts definitions and invocations over to the new name,
    including the instrumented version, using the following script:

    ----
    git grep -w __atomic_add_unless | while read line; do
    sed -i '{s/\/atomic_fetch_add_unless/}' "${line%%:*}";
    done
    git grep -w __arch_atomic_add_unless | while read line; do
    sed -i '{s/\/arch_atomic_fetch_add_unless/}' "${line%%:*}";
    done
    ----

    Note that we do not have atomic{64,_long}_fetch_add_unless(), which will
    be introduced by later patches.

    There should be no functional change as a result of this patch.

    Signed-off-by: Mark Rutland
    Reviewed-by: Will Deacon
    Acked-by: Geert Uytterhoeven
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Palmer Dabbelt
    Cc: Boqun Feng
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Link: https://lore.kernel.org/lkml/20180621121321.4761-2-mark.rutland@arm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
     

31 Mar, 2018

3 commits

  • When a new client call is requested, an rxrpc_conn_parameters struct object
    is passed in with a bunch of parameters set, such as the local endpoint to
    use. A pointer to the target peer record is also placed in there by
    rxrpc_get_client_conn() - and this is removed if and only if a new
    connection object is allocated. Thus it leaks if a new connection object
    isn't allocated.

    Fix this by putting any peer object attached to the rxrpc_conn_parameters
    object in the function that allocated it.

    Fixes: 19ffa01c9c45 ("rxrpc: Use structs to hold connection params and protocol info")
    Signed-off-by: David Howells

    David Howells
     
  • Add a tracepoint to track reference counting on the rxrpc_peer struct.

    Signed-off-by: David Howells

    David Howells
     
  • Fix the firewall route keepalive part of AF_RXRPC which is currently
    function incorrectly by replying to VERSION REPLY packets from the server
    with VERSION REQUEST packets.

    Instead, send VERSION REPLY packets to the peers of service connections to
    act as keep-alives 20s after the latest packet was transmitted to that
    peer.

    Also, just discard VERSION REPLY packets rather than replying to them.

    Signed-off-by: David Howells

    David Howells
     

18 Oct, 2017

1 commit

  • Provide a couple of functions to allow cleaner handling of signals in a
    kernel service. They are:

    (1) rxrpc_kernel_get_rtt()

    This allows the kernel service to find out the RTT time for a call, so
    as to better judge how large a timeout to employ.

    Note, though, that whilst this returns a value in nanoseconds, the
    timeouts can only actually be in jiffies.

    (2) rxrpc_kernel_check_life()

    This returns a number that is updated when ACKs are received from the
    peer (notably including PING RESPONSE ACKs which we can elicit by
    sending PING ACKs to see if the call still exists on the server).

    The caller should compare the numbers of two calls to see if the call
    is still alive.

    These can be used to provide an extending timeout rather than returning
    immediately in the case that a signal occurs that would otherwise abort an
    RPC operation. The timeout would be extended if the server is still
    responsive and the call is still apparently alive on the server.

    For most operations this isn't that necessary - but for FS.StoreData it is:
    OpenAFS writes the data to storage as it comes in without making a backup,
    so if we immediately abort it when partially complete on a CTRL+C, say, we
    have no idea of the state of the file after the abort.

    Signed-off-by: David Howells

    David Howells
     

15 Jun, 2017

1 commit

  • Cache the congestion window setting that was determined during a call's
    transmission phase when it finishes so that it can be used by the next call
    to the same peer, thereby shortcutting the slow-start algorithm.

    The value is stored in the rxrpc_peer struct and is accessed without
    locking. Each call takes the value that happens to be there when it starts
    and just overwrites the value when it finishes.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

26 May, 2017

1 commit

  • Support network namespacing in AF_RXRPC with the following changes:

    (1) All the local endpoint, peer and call lists, locks, counters, etc. are
    moved into the per-namespace record.

    (2) All the connection tracking is moved into the per-namespace record
    with the exception of the client connection ID tree, which is kept
    global so that connection IDs are kept unique per-machine.

    (3) Each namespace gets its own epoch. This allows each network namespace
    to pretend to be a separate client machine.

    (4) The /proc/net/rxrpc_xxx files are now called /proc/net/rxrpc/xxx and
    the contents reflect the namespace.

    fs/afs/ should be okay with this patch as it explicitly requires the current
    net namespace to be init_net to permit a mount to proceed at the moment. It
    will, however, need updating so that cells, IP addresses and DNS records are
    per-namespace also.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

13 Oct, 2016

1 commit


22 Sep, 2016

1 commit

  • Reduce the number of ACK-Requests we set on DATA packets that we're sending
    to reduce network traffic. We set the flag on odd-numbered DATA packets to
    start off the RTT cache until we have at least three entries in it and then
    probe once per second thereafter to keep it topped up.

    This could be made tunable in future.

    Note that from this point, the RXRPC_REQUEST_ACK flag is set on DATA
    packets as we transmit them and not stored statically in the sk_buff.

    Signed-off-by: David Howells

    David Howells
     

17 Sep, 2016

1 commit

  • Add CONFIG_AF_RXRPC_IPV6 and make the IPv6 support code conditional on it.
    This is then made conditional on CONFIG_IPV6.

    Without this, the following can be seen:

    net/built-in.o: In function `rxrpc_init_peer':
    >> peer_object.c:(.text+0x18c3c8): undefined reference to `ip6_route_output_flags'

    Reported-by: kbuild test robot
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

14 Sep, 2016

2 commits

  • Add IPv6 support to AF_RXRPC. With this, AF_RXRPC sockets can be created:

    service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET6);

    instead of:

    service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);

    The AFS filesystem doesn't support IPv6 at the moment, though, since that
    requires upgrades to some of the RPC calls.

    Note that a good portion of this patch is replacing "%pI4:%u" in print
    statements with "%pISpc" which is able to handle both protocols and print
    the port.

    Signed-off-by: David Howells

    David Howells
     
  • Peer records created for incoming connections weren't getting their hash
    key set. This meant that incoming calls wouldn't see more than one DATA
    packet - which is not a problem for AFS CM calls with small request data
    blobs.

    Signed-off-by: David Howells

    David Howells
     

08 Sep, 2016

1 commit

  • Rewrite the data and ack handling code such that:

    (1) Parsing of received ACK and ABORT packets and the distribution and the
    filing of DATA packets happens entirely within the data_ready context
    called from the UDP socket. This allows us to process and discard ACK
    and ABORT packets much more quickly (they're no longer stashed on a
    queue for a background thread to process).

    (2) We avoid calling skb_clone(), pskb_pull() and pskb_trim(). We instead
    keep track of the offset and length of the content of each packet in
    the sk_buff metadata. This means we don't do any allocation in the
    receive path.

    (3) Jumbo DATA packet parsing is now done in data_ready context. Rather
    than cloning the packet once for each subpacket and pulling/trimming
    it, we file the packet multiple times with an annotation for each
    indicating which subpacket is there. From that we can directly
    calculate the offset and length.

    (4) A call's receive queue can be accessed without taking locks (memory
    barriers do have to be used, though).

    (5) Incoming calls are set up from preallocated resources and immediately
    made live. They can than have packets queued upon them and ACKs
    generated. If insufficient resources exist, DATA packet #1 is given a
    BUSY reply and other DATA packets are discarded).

    (6) sk_buffs no longer take a ref on their parent call.

    To make this work, the following changes are made:

    (1) Each call's receive buffer is now a circular buffer of sk_buff
    pointers (rxtx_buffer) rather than a number of sk_buff_heads spread
    between the call and the socket. This permits each sk_buff to be in
    the buffer multiple times. The receive buffer is reused for the
    transmit buffer.

    (2) A circular buffer of annotations (rxtx_annotations) is kept parallel
    to the data buffer. Transmission phase annotations indicate whether a
    buffered packet has been ACK'd or not and whether it needs
    retransmission.

    Receive phase annotations indicate whether a slot holds a whole packet
    or a jumbo subpacket and, if the latter, which subpacket. They also
    note whether the packet has been decrypted in place.

    (3) DATA packet window tracking is much simplified. Each phase has just
    two numbers representing the window (rx_hard_ack/rx_top and
    tx_hard_ack/tx_top).

    The hard_ack number is the sequence number before base of the window,
    representing the last packet the other side says it has consumed.
    hard_ack starts from 0 and the first packet is sequence number 1.

    The top number is the sequence number of the highest-numbered packet
    residing in the buffer. Packets between hard_ack+1 and top are
    soft-ACK'd to indicate they've been received, but not yet consumed.

    Four macros, before(), before_eq(), after() and after_eq() are added
    to compare sequence numbers within the window. This allows for the
    top of the window to wrap when the hard-ack sequence number gets close
    to the limit.

    Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also
    to indicate when rx_top and tx_top point at the packets with the
    LAST_PACKET bit set, indicating the end of the phase.

    (4) Calls are queued on the socket 'receive queue' rather than packets.
    This means that we don't need have to invent dummy packets to queue to
    indicate abnormal/terminal states and we don't have to keep metadata
    packets (such as ABORTs) around

    (5) The offset and length of a (sub)packet's content are now passed to
    the verify_packet security op. This is currently expected to decrypt
    the packet in place and validate it.

    However, there's now nowhere to store the revised offset and length of
    the actual data within the decrypted blob (there may be a header and
    padding to skip) because an sk_buff may represent multiple packets, so
    a locate_data security op is added to retrieve these details from the
    sk_buff content when needed.

    (6) recvmsg() now has to handle jumbo subpackets, where each subpacket is
    individually secured and needs to be individually decrypted. The code
    to do this is broken out into rxrpc_recvmsg_data() and shared with the
    kernel API. It now iterates over the call's receive buffer rather
    than walking the socket receive queue.

    Additional changes:

    (1) The timers are condensed to a single timer that is set for the soonest
    of three timeouts (delayed ACK generation, DATA retransmission and
    call lifespan).

    (2) Transmission of ACK and ABORT packets is effected immediately from
    process-context socket ops/kernel API calls that cause them instead of
    them being punted off to a background work item. The data_ready
    handler still has to defer to the background, though.

    (3) A shutdown op is added to the AF_RXRPC socket so that the AFS
    filesystem can shut down the socket and flush its own work items
    before closing the socket to deal with any in-progress service calls.

    Future additional changes that will need to be considered:

    (1) Make sure that a call doesn't hog the front of the queue by receiving
    data from the network as fast as userspace is consuming it to the
    exclusion of other calls.

    (2) Transmit delayed ACKs from within recvmsg() when we've consumed
    sufficiently more packets to avoid the background work item needing to
    run.

    Signed-off-by: David Howells

    David Howells
     

30 Aug, 2016

1 commit

  • Provide a function so that kernel users, such as AFS, can ask for the peer
    address of a call:

    void rxrpc_kernel_get_peer(struct rxrpc_call *call,
    struct sockaddr_rxrpc *_srx);

    In the future the kernel service won't get sk_buffs to look inside.
    Further, this allows us to hide any canonicalisation inside AF_RXRPC for
    when IPv6 support is added.

    Also propagate this through to afs_find_server() and issue a warning if we
    can't handle the address family yet.

    Signed-off-by: David Howells

    David Howells
     

06 Jul, 2016

1 commit

  • Move to using RCU access to a peer's service connection tree when routing
    an incoming packet. This is done using a seqlock to trigger retrying of
    the tree walk if a change happened.

    Further, we no longer get a ref on the connection looked up in the
    data_ready handler unless we queue the connection's work item - and then
    only if the refcount > 0.

    Note that I'm avoiding the use of a hash table for service connections
    because each service connection is addressed by a 62-bit number
    (constructed from epoch and connection ID >> 2) that would allow the client
    to engage in bucket stuffing, given knowledge of the hash algorithm.
    Peers, however, are hashed as the network address is less controllable by
    the client. The total number of peers will also be limited in a future
    commit.

    Signed-off-by: David Howells

    David Howells
     

22 Jun, 2016

2 commits

  • The rxrpc_transport struct is now redundant, given that the rxrpc_peer
    struct is now per peer port rather than per peer host, so get rid of it.

    Service connection lists are transferred to the rxrpc_peer struct, as is
    the conn_lock. Previous patches moved the client connection handling out
    of the rxrpc_transport struct and discarded the connection bundling code.

    Signed-off-by: David Howells

    David Howells
     
  • Hashing the peer key was introduced for AF_INET, but gcc
    warns about the rxrpc_peer_hash_key function returning uninitialized
    data for any other value of srx->transport.family:

    net/rxrpc/peer_object.c: In function 'rxrpc_peer_hash_key':
    net/rxrpc/peer_object.c:57:15: error: 'p' may be used uninitialized in this function [-Werror=maybe-uninitialized]

    Assuming that nothing else can be set here, this changes the
    function to just return zero in case of an unknown address
    family.

    Fixes: be6e6707f6ee ("rxrpc: Rework peer object handling to use hash table and RCU")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David Howells

    Arnd Bergmann
     

15 Jun, 2016

2 commits

  • Use the peer record to distribute network errors rather than the transport
    object (which I want to get rid of). An error from a particular peer
    terminates all calls on that peer.

    For future consideration:

    (1) For ICMP-induced errors it might be worth trying to extract the RxRPC
    header from the offending packet, if one is returned attached to the
    ICMP packet, to better direct the error.

    This may be overkill, though, since an ICMP packet would be expected
    to be relating to the destination port, machine or network. RxRPC
    ABORT and BUSY packets give notice at RxRPC level.

    (2) To also abort connection-level communications (such as CHALLENGE
    packets) where indicted by an error - but that requires some revamping
    of the connection event handling first.

    Signed-off-by: David Howells

    David Howells
     
  • Rework peer object handling to use a hash table instead of a flat list and
    to use RCU. Peer objects are no longer destroyed by passing them to a
    workqueue to process, but rather are just passed to the RCU garbage
    collector as kfree'able objects.

    The hash function uses the local endpoint plus all the components of the
    remote address, except for the RxRPC service ID. Peers thus represent a
    UDP port on the remote machine as contacted by a UDP port on this machine.

    The RCU read lock is used to handle non-creating lookups so that they can
    be called from bottom half context in the sk_error_report handler without
    having to lock the hash table against modification.
    rxrpc_lookup_peer_rcu() *does* take a reference on the peer object as in
    the future, this will be passed to a work item for error distribution in
    the error_report path and this function will cease being used in the
    data_ready path.

    Creating lookups are done under spinlock rather than mutex as they might be
    set up due to an external stimulus if the local endpoint is a server.

    Captured network error messages (ICMP) are handled with respect to this
    struct and MTU size and RTT are cached here.

    Signed-off-by: David Howells

    David Howells
     

13 Jun, 2016

1 commit

  • Rename files matching net/rxrpc/ar-*.c to get rid of the "ar-" prefix.
    This will aid splitting those files by making easier to come up with new
    names.

    Note that the not all files are simply renamed from ar-X.c to X.c. The
    following exceptions are made:

    (*) ar-call.c -> call_object.c
    ar-ack.c -> call_event.c

    call_object.c is going to contain the core of the call object
    handling. Call event handling is all going to be in call_event.c.

    (*) ar-accept.c -> call_accept.c

    Incoming call handling is going to be here.

    (*) ar-connection.c -> conn_object.c
    ar-connevent.c -> conn_event.c

    The former file is going to have the basic connection object handling,
    but there will likely be some differentiation between client
    connections and service connections in additional files later. The
    latter file will have all the connection-level event handling.

    (*) ar-local.c -> local_object.c

    This will have the local endpoint object handling code. The local
    endpoint event handling code will later be split out into
    local_event.c.

    (*) ar-peer.c -> peer_object.c

    This will have the peer endpoint object handling code. Peer event
    handling code will be placed in peer_event.c (for the moment, there is
    none).

    (*) ar-error.c -> peer_event.c

    This will become the peer event handling code, though for the moment
    it's actually driven from the local endpoint's perspective.

    Note that I haven't renamed ar-transport.c to transport_object.c as the
    intention is to delete it when the rxrpc_transport struct is excised.

    The only file that actually has its contents changed is net/rxrpc/Makefile.

    net/rxrpc/ar-internal.h will need its section marker comments updating, but
    I'll do that in a separate patch to make it easier for git to follow the
    history across the rename. I may also want to rename ar-internal.h at some
    point - but that would mean updating all the #includes and I'd rather do
    that in a separate step.

    Signed-off-by: David Howells <dhowells@redhat.com.

    David Howells