30 May, 2018

1 commit

  • [ Upstream commit 57b0c9d49b94bbeb53649b7fbd264603c1ebd585 ]

    If a call-level abort is received for the previous call to complete on a
    connection channel, then that abort is queued for the connection processor
    to handle. Unfortunately, the connection processor then assumes without
    checking that the abort is connection-level (ie. callNumber is 0) and
    distributes it over all active calls on that connection, thereby
    incorrectly aborting them.

    Fix this by discarding aborts aimed at a completed call.

    Further, discard all packets aimed at a call that's complete if there's
    currently an active call on a channel, since the DATA packets associated
    with the new call automatically terminate the old call.

    Fixes: 18bfeba50dfd ("rxrpc: Perform terminal call ACK/ABORT retransmission from conn processor")
    Reported-by: Marc Dionne
    Signed-off-by: David Howells
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     

05 Jun, 2017

1 commit

  • Make it possible for a client to use AuriStor's service upgrade facility.

    The client does this by adding an RXRPC_UPGRADE_SERVICE control message to
    the first sendmsg() of a call. This takes no parameters.

    When recvmsg() starts returning data from the call, the service ID field in
    the returned msg_name will reflect the result of the upgrade attempt. If
    the upgrade was ignored, srx_service will match what was set in the
    sendmsg(); if the upgrade happened the srx_service will be altered to
    indicate the service the server upgraded to.

    Note that:

    (1) The choice of upgrade service is up to the server

    (2) Further client calls to the same server that would share a connection
    are blocked if an upgrade probe is in progress.

    (3) This should only be used to probe the service. Clients should then
    use the returned service ID in all subsequent communications with that
    server (and not set the upgrade). Note that the kernel will not
    retain this information should the connection expire from its cache.

    (4) If a server that supports upgrading is replaced by one that doesn't,
    whilst a connection is live, and if the replacement is running, say,
    OpenAFS 1.6.4 or older or an older IBM AFS, then the replacement
    server will not respond to packets sent to the upgraded connection.

    At this point, calls will time out and the server must be reprobed.

    Signed-off-by: David Howells

    David Howells
     

06 Apr, 2017

4 commits

  • Add a tracepoint (rxrpc_rx_rwind_change) to log changes in a call's receive
    window size as imposed by the peer through an ACK packet.

    Signed-off-by: David Howells

    David Howells
     
  • Add a tracepoint (rxrpc_rx_abort) to record received aborts.

    Signed-off-by: David Howells

    David Howells
     
  • Add a tracepoint (rxrpc_rx_proto) to record protocol errors in received
    packets. The following changes are made:

    (1) Add a function, __rxrpc_abort_eproto(), to note a protocol error on a
    call and mark the call aborted. This is wrapped by
    rxrpc_abort_eproto() that makes the why string usable in trace.

    (2) Add trace_rxrpc_rx_proto() or rxrpc_abort_eproto() to protocol error
    generation points, replacing rxrpc_abort_call() with the latter.

    (3) Only send an abort packet in rxkad_verify_packet*() if we actually
    managed to abort the call.

    Note that a trace event is also emitted if a kernel user (e.g. afs) tries
    to send data through a call when it's not in the transmission phase, though
    it's not technically a receive event.

    Signed-off-by: David Howells

    David Howells
     
  • Use negative error codes in struct rxrpc_call::error because that's what
    the kernel normally deals with and to make the code consistent. We only
    turn them positive when transcribing into a cmsg for userspace recvmsg.

    Signed-off-by: David Howells

    David Howells
     

11 Mar, 2017

1 commit

  • The RxRPC ACK packet may contain an extension that includes the peer's
    current Rx window size for this call. We adjust the local Tx window size
    to match. However, the transmitter can stall if the receive window is
    reduced to 0 by the peer and then reopened.

    This is because the normal way that the transmitter is re-energised is by
    dropping something out of our Tx queue and thus making space. When a
    single gap is made, the transmitter is woken up. However, because there's
    nothing in the Tx queue at this point, this doesn't happen.

    To fix this, perform a wake_up() any time we see the peer's Rx window size
    increasing.

    The observable symptom is that calls start failing on ETIMEDOUT and the
    following:

    kAFS: SERVER DEAD state=-62

    appears in dmesg.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

08 Mar, 2017

1 commit


02 Mar, 2017

1 commit

  • All the routines by which rxrpc is accessed from the outside are serialised
    by means of the socket lock (sendmsg, recvmsg, bind,
    rxrpc_kernel_begin_call(), ...) and this presents a problem:

    (1) If a number of calls on the same socket are in the process of
    connection to the same peer, a maximum of four concurrent live calls
    are permitted before further calls need to wait for a slot.

    (2) If a call is waiting for a slot, it is deep inside sendmsg() or
    rxrpc_kernel_begin_call() and the entry function is holding the socket
    lock.

    (3) sendmsg() and recvmsg() or the in-kernel equivalents are prevented
    from servicing the other calls as they need to take the socket lock to
    do so.

    (4) The socket is stuck until a call is aborted and makes its slot
    available to the waiter.

    Fix this by:

    (1) Provide each call with a mutex ('user_mutex') that arbitrates access
    by the users of rxrpc separately for each specific call.

    (2) Make rxrpc_sendmsg() and rxrpc_recvmsg() unlock the socket as soon as
    they've got a call and taken its mutex.

    Note that I'm returning EWOULDBLOCK from recvmsg() if MSG_DONTWAIT is
    set but someone else has the lock. Should I instead only return
    EWOULDBLOCK if there's nothing currently to be done on a socket, and
    sleep in this particular instance because there is something to be
    done, but we appear to be blocked by the interrupt handler doing its
    ping?

    (3) Make rxrpc_new_client_call() unlock the socket after allocating a new
    call, locking its user mutex and adding it to the socket's call tree.
    The call is returned locked so that sendmsg() can add data to it
    immediately.

    From the moment the call is in the socket tree, it is subject to
    access by sendmsg() and recvmsg() - even if it isn't connected yet.

    (4) Lock new service calls in the UDP data_ready handler (in
    rxrpc_new_incoming_call()) because they may already be in the socket's
    tree and the data_ready handler makes them live immediately if a user
    ID has already been preassigned.

    Note that the new call is locked before any notifications are sent
    that it is live, so doing mutex_trylock() *ought* to always succeed.
    Userspace is prevented from doing sendmsg() on calls that are in a
    too-early state in rxrpc_do_sendmsg().

    (5) Make rxrpc_new_incoming_call() return the call with the user mutex
    held so that a ping can be scheduled immediately under it.

    Note that it might be worth moving the ping call into
    rxrpc_new_incoming_call() and then we can drop the mutex there.

    (6) Make rxrpc_accept_call() take the lock on the call it is accepting and
    release the socket after adding the call to the socket's tree. This
    is slightly tricky as we've dequeued the call by that point and have
    to requeue it.

    Note that requeuing emits a trace event.

    (7) Make rxrpc_kernel_send_data() and rxrpc_kernel_recv_data() take the
    new mutex immediately and don't bother with the socket mutex at all.

    This patch has the nice bonus that calls on the same socket are now to some
    extent parallelisable.

    Note that we might want to move rxrpc_service_prealloc() calls out from the
    socket lock and give it its own lock, so that we don't hang progress in
    other calls because we're waiting for the allocator.

    We probably also want to avoid calling rxrpc_notify_socket() from within
    the socket lock (rxrpc_accept_call()).

    Signed-off-by: David Howells
    Tested-by: Marc Dionne
    Signed-off-by: David S. Miller

    David Howells
     

05 Jan, 2017

2 commits

  • Add the following extra tracing information:

    (1) Modify the rxrpc_transmit tracepoint to record the Tx window size as
    this is varied by the slow-start algorithm.

    (2) Modify the rxrpc_rx_ack tracepoint to record more information from
    received ACK packets.

    (3) Add an rxrpc_rx_data tracepoint to record the information in DATA
    packets.

    (4) Add an rxrpc_disconnect_call tracepoint to record call disconnection,
    including the reason the call was disconnected.

    (5) Add an rxrpc_improper_term tracepoint to record implicit termination
    of a call by a client either by starting a new call on a particular
    connection channel without first transmitting the final ACK for the
    previous call.

    Signed-off-by: David Howells

    David Howells
     
  • Fix the way enum values are translated into strings in AF_RXRPC
    tracepoints. The problem with just doing a lookup in a normal flat array
    of strings or chars is that external tracing infrastructure can't find it.
    Rather, TRACE_DEFINE_ENUM must be used.

    Also sort the enums and string tables to make it easier to keep them in
    order so that a future patch to __print_symbolic() can be optimised to try
    a direct lookup into the table first before iterating over it.

    A couple of _proto() macro calls are removed because they refered to tables
    that got moved to the tracing infrastructure. The relevant data can be
    found by way of tracing.

    Signed-off-by: David Howells

    David Howells
     

08 Nov, 2016

1 commit

  • A new argument is added to __skb_recv_datagram to provide
    an explicit skb destructor, invoked under the receive queue
    lock.
    The UDP protocol uses such argument to perform memory
    reclaiming on dequeue, so that the UDP protocol does not
    set anymore skb->desctructor.
    Instead explicit memory reclaiming is performed at close() time and
    when skbs are removed from the receive queue.
    The in kernel UDP protocol users now need to call a
    skb_recv_udp() variant instead of skb_recv_datagram() to
    properly perform memory accounting on dequeue.

    Overall, this allows acquiring only once the receive queue
    lock on dequeue.

    Tested using pktgen with random src port, 64 bytes packet,
    wire-speed on a 10G link as sender and udp_sink as the receiver,
    using an l4 tuple rxhash to stress the contention, and one or more
    udp_sink instances with reuseport.

    nr sinks vanilla patched
    1 440 560
    3 2150 2300
    6 3650 3800
    9 4450 4600
    12 6250 6450

    v1 -> v2:
    - do rmem and allocated memory scheduling under the receive lock
    - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
    - avoid the typdef for the dequeue callback

    Suggested-by: Eric Dumazet
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: Paolo Abeni
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     

06 Oct, 2016

3 commits

  • OpenAFS doesn't always correctly terminate client calls that it makes -
    this includes calls the OpenAFS servers make to the cache manager service.
    It should end the client call with either:

    (1) An ACK that has firstPacket set to one greater than the seq number of
    the reply DATA packet with the LAST_PACKET flag set (thereby
    hard-ACK'ing all packets). nAcks should be 0 and acks[] should be
    empty (ie. no soft-ACKs).

    (2) An ACKALL packet.

    OpenAFS, though, may send an ACK packet with firstPacket set to the last
    seq number or less and soft-ACKs listed for all packets up to and including
    the last DATA packet.

    The transmitter, however, is obliged to keep the call live and the
    soft-ACK'd DATA packets around until they're hard-ACK'd as the receiver is
    permitted to drop any merely soft-ACK'd packet and request retransmission
    by sending an ACK packet with a NACK in it.

    Further, OpenAFS will also terminate a client call by beginning the next
    client call on the same connection channel. This implicitly completes the
    previous call.

    This patch handles implicit ACK of a call on a channel by the reception of
    the first packet of the next call on that channel.

    If another call doesn't come along to implicitly ACK a call, then we have
    to time the call out. There are some bugs there that will be addressed in
    subsequent patches.

    Signed-off-by: David Howells

    David Howells
     
  • Separate the output of PING ACKs from the output of other sorts of ACK so
    that if we receive a PING ACK and schedule transmission of a PING RESPONSE
    ACK, the response doesn't get cancelled by a PING ACK we happen to be
    scheduling transmission of at the same time.

    If a PING RESPONSE gets lost, the other side might just sit there waiting
    for it and refuse to proceed otherwise.

    Signed-off-by: David Howells

    David Howells
     
  • When a reply is deemed lost, we send a ping to find out the other end
    received all the request data packets we sent. This should be limited to
    client calls and we shouldn't do this on service calls.

    Signed-off-by: David Howells

    David Howells
     

30 Sep, 2016

5 commits


25 Sep, 2016

5 commits

  • Implement RxRPC slow-start, which is similar to RFC 5681 for TCP. A
    tracepoint is added to log the state of the congestion management algorithm
    and the decisions it makes.

    Notes:

    (1) Since we send fixed-size DATA packets (apart from the final packet in
    each phase), counters and calculations are in terms of packets rather
    than bytes.

    (2) The ACK packet carries the equivalent of TCP SACK.

    (3) The FLIGHT_SIZE calculation in RFC 5681 doesn't seem particularly
    suited to SACK of a small number of packets. It seems that, almost
    inevitably, by the time three 'duplicate' ACKs have been seen, we have
    narrowed the loss down to one or two missing packets, and the
    FLIGHT_SIZE calculation ends up as 2.

    (4) In rxrpc_resend(), if there was no data that apparently needed
    retransmission, we transmit a PING ACK to ask the peer to tell us what
    its Rx window state is.

    Signed-off-by: David Howells

    David Howells
     
  • If we've sent all the request data in a client call but haven't seen any
    sign of the reply data yet, schedule an ACK to be sent to the server to
    find out if the reply data got lost.

    If the server hasn't yet hard-ACK'd the request data, we send a PING ACK to
    demand a response to find out whether we need to retransmit.

    If the server says it has received all of the data, we send an IDLE ACK to
    tell the server that we haven't received anything in the receive phase as
    yet.

    To make this work, a non-immediate PING ACK must carry a delay. I've chosen
    the same as the IDLE ACK for the moment.

    Signed-off-by: David Howells

    David Howells
     
  • Generate a summary of the Tx buffer packet state when an ACK is received
    for use in a later patch that does congestion management.

    Signed-off-by: David Howells

    David Howells
     
  • Clear the ACK reason, ACK timer and resend timer when entering the client
    reply phase when the first DATA packet is received. New ACKs will be
    proposed once the data is queued.

    The resend timer is no longer relevant and we need to cancel ACKs scheduled
    to probe for a lost reply.

    Signed-off-by: David Howells

    David Howells
     
  • Send an immediate ACK if we fill in a hole in the buffer left by an
    out-of-sequence packet. This may allow the congestion management in the peer
    to avoid a retransmission if packets got reordered on the wire.

    Signed-off-by: David Howells

    David Howells
     

23 Sep, 2016

5 commits

  • Add a tracepoint to log proposed ACKs, including whether the proposal is
    used to update a pending ACK or is discarded in favour of an easlier,
    higher priority ACK.

    Whilst we're at it, get rid of the rxrpc_acks() function and access the
    name array directly. We do, however, need to validate the ACK reason
    number given to trace_rxrpc_rx_ack() to make sure we don't overrun the
    array.

    Signed-off-by: David Howells

    David Howells
     
  • Add a tracepoint to log received packets that get discarded due to Rx
    packet loss.

    Signed-off-by: David Howells

    David Howells
     
  • When the last packet of data to be transmitted on a call is queued, tx_top
    is set and then the RXRPC_CALL_TX_LAST flag is set. Unfortunately, this
    leaves a race in the ACK processing side of things because the flag affects
    the interpretation of tx_top and also allows us to start receiving reply
    data before we've finished transmitting.

    To fix this, make the following changes:

    (1) rxrpc_queue_packet() now sets a marker in the annotation buffer
    instead of setting the RXRPC_CALL_TX_LAST flag.

    (2) rxrpc_rotate_tx_window() detects the marker and sets the flag in the
    same context as the routines that use it.

    (3) rxrpc_end_tx_phase() is simplified to just shift the call state.
    The Tx window must have been rotated before calling to discard the
    last packet.

    (4) rxrpc_receiving_reply() is added to handle the arrival of the first
    DATA packet of a reply to a client call (which is an implicit ACK of
    the Tx phase).

    (5) The last part of rxrpc_input_ack() is reordered to perform Tx
    rotation, then soft-ACK application and then to end the phase if we've
    rotated the last packet. In the event of a terminal ACK, the soft-ACK
    application will be skipped as nAcks should be 0.

    (6) rxrpc_input_ackall() now has to rotate as well as ending the phase.

    In addition:

    (7) Alter the transmit tracepoint to log the rotation of the last packet.

    (8) Remove the no-longer relevant queue_reqack tracepoint note. The
    ACK-REQUESTED packet header flag is now set as needed when we actually
    transmit the packet and may vary by retransmission.

    Signed-off-by: David Howells

    David Howells
     
  • When rxrpc_input_soft_acks() is parsing the soft-ACKs from an ACK packet,
    it updates the Tx packet annotations in the annotation buffer. If a
    soft-ACK is an ACK, then we overwrite unack'd, nak'd or to-be-retransmitted
    states and that is fine; but if the soft-ACK is an NACK, we overwrite the
    to-be-retransmitted with a nak - which isn't.

    Instead, we need to let any scheduled retransmission stand if the packet
    was NAK'd.

    Note that we don't reissue a resend if the annotation is in the
    to-be-retransmitted state because someone else must've scheduled the
    resend already.

    Signed-off-by: David Howells

    David Howells
     
  • before_eq() and friends should be used to compare serial numbers (when not
    checking for (non)equality) rather than casting to int, subtracting and
    checking the result.

    Signed-off-by: David Howells

    David Howells
     

22 Sep, 2016

4 commits

  • We don't want to send a PING ACK for every new incoming call as that just
    adds to the network traffic. Instead, we send a PING ACK to the first
    three that we receive and then once per second thereafter.

    This could probably be made adjustable in future.

    Signed-off-by: David Howells

    David Howells
     
  • In addition to sending a PING ACK to gain RTT data, we can set the
    RXRPC_REQUEST_ACK flag on a DATA packet and get a REQUESTED-ACK ACK. The
    ACK packet contains the serial number of the packet it is in response to,
    so we can look through the Tx buffer for a matching DATA packet.

    This requires that the data packets be stamped with the time of
    transmission as a ktime rather than having the resend_at time in jiffies.

    This further requires the resend code to do the resend determination in
    ktimes and convert to jiffies to set the timer.

    Signed-off-by: David Howells

    David Howells
     
  • Send a PING ACK packet to the peer when we get a new incoming call from a
    peer we don't have a record for. The PING RESPONSE ACK packet will tell us
    the following about the peer:

    (1) its receive window size

    (2) its MTU sizes

    (3) its support for jumbo DATA packets

    (4) if it supports slow start (similar to RFC 5681)

    (5) an estimate of the RTT

    This is necessary because the peer won't normally send us an ACK until it
    gets to the Rx phase and we send it a packet, but we would like to know
    some of this information before we start sending packets.

    A pair of tracepoints are added so that RTT determination can be observed.

    Signed-off-by: David Howells

    David Howells
     
  • Add a Tx-phase annotation for packet buffers to indicate that a buffer has
    already been retransmitted. This will be used by future congestion
    management. Re-retransmissions of a packet don't affect the congestion
    window managment in the same way as initial retransmissions.

    Signed-off-by: David Howells

    David Howells
     

17 Sep, 2016

6 commits