04 Nov, 2018

1 commit

  • [ Upstream commit 647530924f47c93db472ee3cf43b7ef1425581b6 ]

    Fix connection-level abort handling to cache the abort and error codes
    properly so that a new incoming call can be properly aborted if it races
    with the parent connection being aborted by another CPU.

    The abort_code and error parameters can then be dropped from
    rxrpc_abort_calls().

    Fixes: f5c17aaeb2ae ("rxrpc: Calls should only have one terminal state")
    Signed-off-by: David Howells
    Signed-off-by: Sasha Levin

    David Howells
     

04 Feb, 2018

2 commits

  • [ Upstream commit f859ab61875978eeaa539740ff7f7d91f5d60006 ]

    RxRPC service endpoints expire like they're supposed to by the following
    means:

    (1) Mark dead rxrpc_net structs (with ->live) rather than twiddling the
    global service conn timeout, otherwise the first rxrpc_net struct to
    die will cause connections on all others to expire immediately from
    then on.

    (2) Mark local service endpoints for which the socket has been closed
    (->service_closed) so that the expiration timeout can be much
    shortened for service and client connections going through that
    endpoint.

    (3) rxrpc_put_service_conn() needs to schedule the reaper when the usage
    count reaches 1, not 0, as idle conns have a 1 count.

    (4) The accumulator for the earliest time we might want to schedule for
    should be initialised to jiffies + MAX_JIFFY_OFFSET, not ULONG_MAX as
    the comparison functions use signed arithmetic.

    (5) Simplify the expiration handling, adding the expiration value to the
    idle timestamp each time rather than keeping track of the time in the
    past before which the idle timestamp must go to be expired. This is
    much easier to read.

    (6) Ignore the timeouts if the net namespace is dead.

    (7) Restart the service reaper work item rather the client reaper.

    Signed-off-by: David Howells
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     
  • [ Upstream commit 9faaff593404a9c4e5abc6839a641635d7b9d0cd ]

    Provide a different lockdep key for rxrpc_call::user_mutex when the call is
    made on a kernel socket, such as by the AFS filesystem.

    The problem is that lockdep registers a false positive between userspace
    calling the sendmsg syscall on a user socket where call->user_mutex is held
    whilst userspace memory is accessed whereas the AFS filesystem may perform
    operations with mmap_sem held by the caller.

    In such a case, the following warning is produced.

    ======================================================
    WARNING: possible circular locking dependency detected
    4.14.0-fscache+ #243 Tainted: G E
    ------------------------------------------------------
    modpost/16701 is trying to acquire lock:
    (&vnode->io_lock){+.+.}, at: [] afs_begin_vnode_operation+0x33/0x77 [kafs]

    but task is already holding lock:
    (&mm->mmap_sem){++++}, at: [] __do_page_fault+0x1ef/0x486

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (&mm->mmap_sem){++++}:
    __might_fault+0x61/0x89
    _copy_from_iter_full+0x40/0x1fa
    rxrpc_send_data+0x8dc/0xff3
    rxrpc_do_sendmsg+0x62f/0x6a1
    rxrpc_sendmsg+0x166/0x1b7
    sock_sendmsg+0x2d/0x39
    ___sys_sendmsg+0x1ad/0x22b
    __sys_sendmsg+0x41/0x62
    do_syscall_64+0x89/0x1be
    return_from_SYSCALL_64+0x0/0x75

    -> #2 (&call->user_mutex){+.+.}:
    __mutex_lock+0x86/0x7d2
    rxrpc_new_client_call+0x378/0x80e
    rxrpc_kernel_begin_call+0xf3/0x154
    afs_make_call+0x195/0x454 [kafs]
    afs_vl_get_capabilities+0x193/0x198 [kafs]
    afs_vl_lookup_vldb+0x5f/0x151 [kafs]
    afs_create_volume+0x2e/0x2f4 [kafs]
    afs_mount+0x56a/0x8d7 [kafs]
    mount_fs+0x6a/0x109
    vfs_kern_mount+0x67/0x135
    do_mount+0x90b/0xb57
    SyS_mount+0x72/0x98
    do_syscall_64+0x89/0x1be
    return_from_SYSCALL_64+0x0/0x75

    -> #1 (k-sk_lock-AF_RXRPC){+.+.}:
    lock_sock_nested+0x74/0x8a
    rxrpc_kernel_begin_call+0x8a/0x154
    afs_make_call+0x195/0x454 [kafs]
    afs_fs_get_capabilities+0x17a/0x17f [kafs]
    afs_probe_fileserver+0xf7/0x2f0 [kafs]
    afs_select_fileserver+0x83f/0x903 [kafs]
    afs_fetch_status+0x89/0x11d [kafs]
    afs_iget+0x16f/0x4f8 [kafs]
    afs_mount+0x6c6/0x8d7 [kafs]
    mount_fs+0x6a/0x109
    vfs_kern_mount+0x67/0x135
    do_mount+0x90b/0xb57
    SyS_mount+0x72/0x98
    do_syscall_64+0x89/0x1be
    return_from_SYSCALL_64+0x0/0x75

    -> #0 (&vnode->io_lock){+.+.}:
    lock_acquire+0x174/0x19f
    __mutex_lock+0x86/0x7d2
    afs_begin_vnode_operation+0x33/0x77 [kafs]
    afs_fetch_data+0x80/0x12a [kafs]
    afs_readpages+0x314/0x405 [kafs]
    __do_page_cache_readahead+0x203/0x2ba
    filemap_fault+0x179/0x54d
    __do_fault+0x17/0x60
    __handle_mm_fault+0x6d7/0x95c
    handle_mm_fault+0x24e/0x2a3
    __do_page_fault+0x301/0x486
    do_page_fault+0x236/0x259
    page_fault+0x22/0x30
    __clear_user+0x3d/0x60
    padzero+0x1c/0x2b
    load_elf_binary+0x785/0xdc7
    search_binary_handler+0x81/0x1ff
    do_execveat_common.isra.14+0x600/0x888
    do_execve+0x1f/0x21
    SyS_execve+0x28/0x2f
    do_syscall_64+0x89/0x1be
    return_from_SYSCALL_64+0x0/0x75

    other info that might help us debug this:

    Chain exists of:
    &vnode->io_lock --> &call->user_mutex --> &mm->mmap_sem

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&mm->mmap_sem);
    lock(&call->user_mutex);
    lock(&mm->mmap_sem);
    lock(&vnode->io_lock);

    *** DEADLOCK ***

    1 lock held by modpost/16701:
    #0: (&mm->mmap_sem){++++}, at: [] __do_page_fault+0x1ef/0x486

    stack backtrace:
    CPU: 0 PID: 16701 Comm: modpost Tainted: G E 4.14.0-fscache+ #243
    Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
    Call Trace:
    dump_stack+0x67/0x8e
    print_circular_bug+0x341/0x34f
    check_prev_add+0x11f/0x5d4
    ? add_lock_to_list.isra.12+0x8b/0x8b
    ? add_lock_to_list.isra.12+0x8b/0x8b
    ? __lock_acquire+0xf77/0x10b4
    __lock_acquire+0xf77/0x10b4
    lock_acquire+0x174/0x19f
    ? afs_begin_vnode_operation+0x33/0x77 [kafs]
    __mutex_lock+0x86/0x7d2
    ? afs_begin_vnode_operation+0x33/0x77 [kafs]
    ? afs_begin_vnode_operation+0x33/0x77 [kafs]
    ? afs_begin_vnode_operation+0x33/0x77 [kafs]
    afs_begin_vnode_operation+0x33/0x77 [kafs]
    afs_fetch_data+0x80/0x12a [kafs]
    afs_readpages+0x314/0x405 [kafs]
    __do_page_cache_readahead+0x203/0x2ba
    ? filemap_fault+0x179/0x54d
    filemap_fault+0x179/0x54d
    __do_fault+0x17/0x60
    __handle_mm_fault+0x6d7/0x95c
    handle_mm_fault+0x24e/0x2a3
    __do_page_fault+0x301/0x486
    do_page_fault+0x236/0x259
    page_fault+0x22/0x30
    RIP: 0010:__clear_user+0x3d/0x60
    RSP: 0018:ffff880071e93da0 EFLAGS: 00010202
    RAX: 0000000000000000 RBX: 000000000000011c RCX: 000000000000011c
    RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000060f720
    RBP: 000000000060f720 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000001 R11: ffff8800b5459b68 R12: ffff8800ce150e00
    R13: 000000000060f720 R14: 00000000006127a8 R15: 0000000000000000
    padzero+0x1c/0x2b
    load_elf_binary+0x785/0xdc7
    search_binary_handler+0x81/0x1ff
    do_execveat_common.isra.14+0x600/0x888
    do_execve+0x1f/0x21
    SyS_execve+0x28/0x2f
    do_syscall_64+0x89/0x1be
    entry_SYSCALL64_slow_path+0x25/0x25
    RIP: 0033:0x7fdb6009ee07
    RSP: 002b:00007fff566d9728 EFLAGS: 00000246 ORIG_RAX: 000000000000003b
    RAX: ffffffffffffffda RBX: 000055ba57280900 RCX: 00007fdb6009ee07
    RDX: 000055ba5727f270 RSI: 000055ba5727cac0 RDI: 000055ba57280900
    RBP: 000055ba57280900 R08: 00007fff566d9700 R09: 0000000000000000
    R10: 000055ba5727cac0 R11: 0000000000000246 R12: 0000000000000000
    R13: 000055ba5727cac0 R14: 000055ba5727f270 R15: 0000000000000000

    Signed-off-by: David Howells
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     

29 Aug, 2017

3 commits

  • Allow a client call that failed on network error to be retried, provided
    that the Tx queue still holds DATA packet 1. This allows an operation to
    be submitted to another server or another address for the same server
    without having to repackage and re-encrypt the data so far processed.

    Two new functions are provided:

    (1) rxrpc_kernel_check_call() - This is used to find out the completion
    state of a call to guess whether it can be retried and whether it
    should be retried.

    (2) rxrpc_kernel_retry_call() - Disconnect the call from its current
    connection, reset the state and submit it as a new client call to a
    new address. The new address need not match the previous address.

    A call may be retried even if all the data hasn't been loaded into it yet;
    a partially constructed will be retained at the same point it was at when
    an error condition was detected. msg_data_left() can be used to find out
    how much data was packaged before the error occurred.

    Signed-off-by: David Howells

    David Howells
     
  • Fix IPv6 support in AF_RXRPC in the following ways:

    (1) When extracting the address from a received IPv4 packet, if the local
    transport socket is open for IPv6 then fill out the sockaddr_rxrpc
    struct for an IPv4-mapped-to-IPv6 AF_INET6 transport address instead
    of an AF_INET one.

    (2) When sending CHALLENGE or RESPONSE packets, the transport length needs
    to be set from the sockaddr_rxrpc::transport_len field rather than
    sizeof() on the IPv4 transport address.

    (3) When processing an IPv4 ICMP packet received by an IPv6 socket, set up
    the address correctly before searching for the affected peer.

    Signed-off-by: David Howells

    David Howells
     
  • Since the 'expiry' variable of 'struct key_preparsed_payload' has been
    changed to 'time64_t' type, which is year 2038 safe on 32bits system.

    In net/rxrpc subsystem, we need convert 'u32' type to 'time64_t' type
    when copying ticket expires time to 'prep->expiry', then this patch
    introduces two helper functions to help convert 'u32' to 'time64_t'
    type.

    This patch also uses ktime_get_real_seconds() to get current time instead
    of get_seconds() which is not year 2038 safe on 32bits system.

    Signed-off-by: Baolin Wang
    Signed-off-by: David Howells

    Baolin Wang
     

21 Jul, 2017

1 commit

  • Move the protocol description header file into net/rxrpc/ and rename it to
    protocol.h. It's no longer necessary to expose it as packets are no longer
    exposed to kernel services (such as AFS) that use the facility.

    The abort codes are transferred to the UAPI header instead as we pass these
    back to userspace and also to kernel services.

    Signed-off-by: David Howells

    David Howells
     

15 Jun, 2017

1 commit

  • Cache the congestion window setting that was determined during a call's
    transmission phase when it finishes so that it can be used by the next call
    to the same peer, thereby shortcutting the slow-start algorithm.

    The value is stored in the rxrpc_peer struct and is accessed without
    locking. Each call takes the value that happens to be there when it starts
    and just overwrites the value when it finishes.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

08 Jun, 2017

1 commit

  • Provide a control message that can be specified on the first sendmsg() of a
    client call or the first sendmsg() of a service response to indicate the
    total length of the data to be transmitted for that call.

    Currently, because the length of the payload of an encrypted DATA packet is
    encrypted in front of the data, the packet cannot be encrypted until we
    know how much data it will hold.

    By specifying the length at the beginning of the transmit phase, each DATA
    packet length can be set before we start loading data from userspace (where
    several sendmsg() calls may contribute to a particular packet).

    An error will be returned if too little or too much data is presented in
    the Tx phase.

    Signed-off-by: David Howells

    David Howells
     

05 Jun, 2017

4 commits

  • Make it possible for a client to use AuriStor's service upgrade facility.

    The client does this by adding an RXRPC_UPGRADE_SERVICE control message to
    the first sendmsg() of a call. This takes no parameters.

    When recvmsg() starts returning data from the call, the service ID field in
    the returned msg_name will reflect the result of the upgrade attempt. If
    the upgrade was ignored, srx_service will match what was set in the
    sendmsg(); if the upgrade happened the srx_service will be altered to
    indicate the service the server upgraded to.

    Note that:

    (1) The choice of upgrade service is up to the server

    (2) Further client calls to the same server that would share a connection
    are blocked if an upgrade probe is in progress.

    (3) This should only be used to probe the service. Clients should then
    use the returned service ID in all subsequent communications with that
    server (and not set the upgrade). Note that the kernel will not
    retain this information should the connection expire from its cache.

    (4) If a server that supports upgrading is replaced by one that doesn't,
    whilst a connection is live, and if the replacement is running, say,
    OpenAFS 1.6.4 or older or an older IBM AFS, then the replacement
    server will not respond to packets sent to the upgraded connection.

    At this point, calls will time out and the server must be reprobed.

    Signed-off-by: David Howells

    David Howells
     
  • Implement AuriStor's service upgrade facility. There are three problems
    that this is meant to deal with:

    (1) Various of the standard AFS RPC calls have IPv4 addresses in their
    requests and/or replies - but there's no room for including IPv6
    addresses.

    (2) Definition of IPv6-specific RPC operations in the standard operation
    sets has not yet been achieved.

    (3) One could envision the creation a new service on the same port that as
    the original service. The new service could implement improved
    operations - and the client could try this first, falling back to the
    original service if it's not there.

    Unfortunately, certain servers ignore packets addressed to a service
    they don't implement and don't respond in any way - not even with an
    ABORT. This means that the client must then wait for the call timeout
    to occur.

    What service upgrade does is to see if the connection is marked as being
    'upgradeable' and if so, change the service ID in the server and thus the
    request and reply formats. Note that the upgrade isn't mandatory - a
    server that supports only the original call set will ignore the upgrade
    request.

    In the protocol, the procedure is then as follows:

    (1) To request an upgrade, the first DATA packet in a new connection must
    have the userStatus set to 1 (this is normally 0). The userStatus
    value is normally ignored by the server.

    (2) If the server doesn't support upgrading, the reply packets will
    contain the same service ID as for the first request packet.

    (3) If the server does support upgrading, all future reply packets on that
    connection will contain the new service ID and the new service ID will
    be applied to *all* further calls on that connection as well.

    (4) The RPC op used to probe the upgrade must take the same request data
    as the shadow call in the upgrade set (but may return a different
    reply). GetCapability RPC ops were added to all standard sets for
    just this purpose. Ops where the request formats differ cannot be
    used for probing.

    (5) The client must wait for completion of the probe before sending any
    further RPC ops to the same destination. It should then use the
    service ID that recvmsg() reported back in all future calls.

    (6) The shadow service must have call definitions for all the operation
    IDs defined by the original service.

    To support service upgrading, a server should:

    (1) Call bind() twice on its AF_RXRPC socket before calling listen().
    Each bind() should supply a different service ID, but the transport
    addresses must be the same. This allows the server to receive
    requests with either service ID.

    (2) Enable automatic upgrading by calling setsockopt(), specifying
    RXRPC_UPGRADEABLE_SERVICE and passing in a two-member array of
    unsigned shorts as the argument:

    unsigned short optval[2];

    This specifies a pair of service IDs. They must be different and must
    match the service IDs bound to the socket. Member 0 is the service ID
    to upgrade from and member 1 is the service ID to upgrade to.

    Signed-off-by: David Howells

    David Howells
     
  • Permit bind() to be called on an AF_RXRPC socket more than once (currently
    maximum twice) to bind multiple listening services to it. There are some
    restrictions:

    (1) All bind() calls involved must have a non-zero service ID.

    (2) The service IDs must all be different.

    (3) The rest of the address (notably the transport part) must be the same
    in all (a single UDP socket is shared).

    (4) This must be done before listen() or sendmsg() is called.

    This allows someone to connect to the service socket with different service
    IDs and lays the foundation for service upgrading.

    The service ID used by an incoming call can be extracted from the msg_name
    returned by recvmsg().

    Signed-off-by: David Howells

    David Howells
     
  • Keep the rxrpc_connection struct's idea of the service ID that is exposed
    in the protocol separate from the service ID that's used as a lookup key.

    This allows the protocol service ID on a client connection to get upgraded
    without making the connection unfindable for other client calls that also
    would like to use the upgraded connection.

    The connection's actual service ID is then returned through recvmsg() by
    way of msg_name.

    Whilst we're at it, we get rid of the last_service_id field from each
    channel. The service ID is per-connection, not per-call and an entire
    connection is upgraded in one go.

    Signed-off-by: David Howells

    David Howells
     

26 May, 2017

1 commit

  • Support network namespacing in AF_RXRPC with the following changes:

    (1) All the local endpoint, peer and call lists, locks, counters, etc. are
    moved into the per-namespace record.

    (2) All the connection tracking is moved into the per-namespace record
    with the exception of the client connection ID tree, which is kept
    global so that connection IDs are kept unique per-machine.

    (3) Each namespace gets its own epoch. This allows each network namespace
    to pretend to be a separate client machine.

    (4) The /proc/net/rxrpc_xxx files are now called /proc/net/rxrpc/xxx and
    the contents reflect the namespace.

    fs/afs/ should be okay with this patch as it explicitly requires the current
    net namespace to be init_net to permit a mount to proceed at the moment. It
    will, however, need updating so that cells, IP addresses and DNS records are
    per-namespace also.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

06 Apr, 2017

1 commit

  • Add a tracepoint (rxrpc_rx_proto) to record protocol errors in received
    packets. The following changes are made:

    (1) Add a function, __rxrpc_abort_eproto(), to note a protocol error on a
    call and mark the call aborted. This is wrapped by
    rxrpc_abort_eproto() that makes the why string usable in trace.

    (2) Add trace_rxrpc_rx_proto() or rxrpc_abort_eproto() to protocol error
    generation points, replacing rxrpc_abort_call() with the latter.

    (3) Only send an abort packet in rxkad_verify_packet*() if we actually
    managed to abort the call.

    Note that a trace event is also emitted if a kernel user (e.g. afs) tries
    to send data through a call when it's not in the transmission phase, though
    it's not technically a receive event.

    Signed-off-by: David Howells

    David Howells
     

02 Mar, 2017

1 commit

  • All the routines by which rxrpc is accessed from the outside are serialised
    by means of the socket lock (sendmsg, recvmsg, bind,
    rxrpc_kernel_begin_call(), ...) and this presents a problem:

    (1) If a number of calls on the same socket are in the process of
    connection to the same peer, a maximum of four concurrent live calls
    are permitted before further calls need to wait for a slot.

    (2) If a call is waiting for a slot, it is deep inside sendmsg() or
    rxrpc_kernel_begin_call() and the entry function is holding the socket
    lock.

    (3) sendmsg() and recvmsg() or the in-kernel equivalents are prevented
    from servicing the other calls as they need to take the socket lock to
    do so.

    (4) The socket is stuck until a call is aborted and makes its slot
    available to the waiter.

    Fix this by:

    (1) Provide each call with a mutex ('user_mutex') that arbitrates access
    by the users of rxrpc separately for each specific call.

    (2) Make rxrpc_sendmsg() and rxrpc_recvmsg() unlock the socket as soon as
    they've got a call and taken its mutex.

    Note that I'm returning EWOULDBLOCK from recvmsg() if MSG_DONTWAIT is
    set but someone else has the lock. Should I instead only return
    EWOULDBLOCK if there's nothing currently to be done on a socket, and
    sleep in this particular instance because there is something to be
    done, but we appear to be blocked by the interrupt handler doing its
    ping?

    (3) Make rxrpc_new_client_call() unlock the socket after allocating a new
    call, locking its user mutex and adding it to the socket's call tree.
    The call is returned locked so that sendmsg() can add data to it
    immediately.

    From the moment the call is in the socket tree, it is subject to
    access by sendmsg() and recvmsg() - even if it isn't connected yet.

    (4) Lock new service calls in the UDP data_ready handler (in
    rxrpc_new_incoming_call()) because they may already be in the socket's
    tree and the data_ready handler makes them live immediately if a user
    ID has already been preassigned.

    Note that the new call is locked before any notifications are sent
    that it is live, so doing mutex_trylock() *ought* to always succeed.
    Userspace is prevented from doing sendmsg() on calls that are in a
    too-early state in rxrpc_do_sendmsg().

    (5) Make rxrpc_new_incoming_call() return the call with the user mutex
    held so that a ping can be scheduled immediately under it.

    Note that it might be worth moving the ping call into
    rxrpc_new_incoming_call() and then we can drop the mutex there.

    (6) Make rxrpc_accept_call() take the lock on the call it is accepting and
    release the socket after adding the call to the socket's tree. This
    is slightly tricky as we've dequeued the call by that point and have
    to requeue it.

    Note that requeuing emits a trace event.

    (7) Make rxrpc_kernel_send_data() and rxrpc_kernel_recv_data() take the
    new mutex immediately and don't bother with the socket mutex at all.

    This patch has the nice bonus that calls on the same socket are now to some
    extent parallelisable.

    Note that we might want to move rxrpc_service_prealloc() calls out from the
    socket lock and give it its own lock, so that we don't hang progress in
    other calls because we're waiting for the allocator.

    We probably also want to avoid calling rxrpc_notify_socket() from within
    the socket lock (rxrpc_accept_call()).

    Signed-off-by: David Howells
    Tested-by: Marc Dionne
    Signed-off-by: David S. Miller

    David Howells
     

09 Jan, 2017

1 commit

  • Allow listen() with a backlog of 0 to be used to disable listening on an
    AF_RXRPC socket. This also releases any preallocation, thereby making it
    easier for a kernel service to account for all allocated call structures
    when shutting down the service.

    The socket cannot thereafter have listening reenabled, but must rather be
    closed and reopened.

    Signed-off-by: David Howells

    David Howells
     

05 Jan, 2017

1 commit

  • Fix the way enum values are translated into strings in AF_RXRPC
    tracepoints. The problem with just doing a lookup in a normal flat array
    of strings or chars is that external tracing infrastructure can't find it.
    Rather, TRACE_DEFINE_ENUM must be used.

    Also sort the enums and string tables to make it easier to keep them in
    order so that a future patch to __print_symbolic() can be optimised to try
    a direct lookup into the table first before iterating over it.

    A couple of _proto() macro calls are removed because they refered to tables
    that got moved to the tracing infrastructure. The relevant data can be
    found by way of tracing.

    Signed-off-by: David Howells

    David Howells
     

06 Oct, 2016

4 commits

  • We need to generate a DELAY ACK from the service end of an operation if we
    start doing the actual operation work and it takes longer than expected.
    This will hard-ACK the request data and allow the client to release its
    resources.

    To make this work:

    (1) We have to set the ack timer and propose an ACK when the call moves to
    the RXRPC_CALL_SERVER_ACK_REQUEST and clear the pending ACK and cancel
    the timer when we start transmitting the reply (the first DATA packet
    of the reply implicitly ACKs the request phase).

    (2) It must be possible to set the timer when the caller is holding
    call->state_lock, so split the lock-getting part of the timer function
    out.

    (3) Add trace notes for the ACK we're requesting and the timer we clear.

    Signed-off-by: David Howells

    David Howells
     
  • Separate the output of PING ACKs from the output of other sorts of ACK so
    that if we receive a PING ACK and schedule transmission of a PING RESPONSE
    ACK, the response doesn't get cancelled by a PING ACK we happen to be
    scheduling transmission of at the same time.

    If a PING RESPONSE gets lost, the other side might just sit there waiting
    for it and refuse to proceed otherwise.

    Signed-off-by: David Howells

    David Howells
     
  • Split rxrpc_send_data_packet() to separate ACK generation (which is more
    complicated) from ABORT generation. This simplifies the code a bit and
    fixes the following warning:

    In file included from ../net/rxrpc/output.c:20:0:
    net/rxrpc/output.c: In function 'rxrpc_send_call_packet':
    net/rxrpc/ar-internal.h:1187:27: error: 'top' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    net/rxrpc/output.c:103:24: note: 'top' was declared here
    net/rxrpc/output.c:225:25: error: 'hard_ack' may be used uninitialized in this function [-Werror=maybe-uninitialized]

    Reported-by: Arnd Bergmann
    Signed-off-by: David Howells

    David Howells
     
  • Remove a duplicate const keyword.

    Signed-off-by: David Howells

    David Howells
     

30 Sep, 2016

5 commits

  • Keep that call timeouts as ktimes rather than jiffies so that they can be
    expressed as functions of RTT.

    Signed-off-by: David Howells

    David Howells
     
  • Remove error from struct rxrpc_skb_priv as it is no longer used.

    Signed-off-by: David Howells

    David Howells
     
  • The offset field in struct rxrpc_skb_priv is unnecessary as the value can
    always be calculated.

    Signed-off-by: David Howells

    David Howells
     
  • Reduce the rxrpc_local::services list to just a pointer as we don't permit
    multiple service endpoints to bind to a single transport endpoints (this is
    excluded by rxrpc_lookup_local()).

    The reason we don't allow this is that if you send a request to an AFS
    filesystem service, it will try to talk back to your cache manager on the
    port you sent from (this is how file change notifications are handled). To
    prevent someone from stealing your CM callbacks, we don't let AF_RXRPC
    sockets share a UDP socket if at least one of them has a service bound.

    Signed-off-by: David Howells

    David Howells
     
  • In rxrpc_send_data_packet() make the loss-injection path return through the
    same code as the transmission path so that the RTT determination is
    initiated and any future timer shuffling will be done, despite the packet
    having been binned.

    Whilst we're at it:

    (1) Add to the tx_data tracepoint an indication of whether or not we're
    retransmitting a data packet.

    (2) When we're deciding whether or not to request an ACK, rather than
    checking if we're in fast-retransmit mode check instead if we're
    retransmitting.

    (3) Don't invoke the lose_skb tracepoint when losing a Tx packet as we're
    not altering the sk_buff refcount nor are we just seeing it after
    getting it off the Tx list.

    (4) The rxrpc_skb_tx_lost note is then no longer used so remove it.

    (5) rxrpc_lose_skb() no longer needs to deal with rxrpc_skb_tx_lost.

    Signed-off-by: David Howells

    David Howells
     

25 Sep, 2016

5 commits

  • Implement RxRPC slow-start, which is similar to RFC 5681 for TCP. A
    tracepoint is added to log the state of the congestion management algorithm
    and the decisions it makes.

    Notes:

    (1) Since we send fixed-size DATA packets (apart from the final packet in
    each phase), counters and calculations are in terms of packets rather
    than bytes.

    (2) The ACK packet carries the equivalent of TCP SACK.

    (3) The FLIGHT_SIZE calculation in RFC 5681 doesn't seem particularly
    suited to SACK of a small number of packets. It seems that, almost
    inevitably, by the time three 'duplicate' ACKs have been seen, we have
    narrowed the loss down to one or two missing packets, and the
    FLIGHT_SIZE calculation ends up as 2.

    (4) In rxrpc_resend(), if there was no data that apparently needed
    retransmission, we transmit a PING ACK to ask the peer to tell us what
    its Rx window state is.

    Signed-off-by: David Howells

    David Howells
     
  • If we've sent all the request data in a client call but haven't seen any
    sign of the reply data yet, schedule an ACK to be sent to the server to
    find out if the reply data got lost.

    If the server hasn't yet hard-ACK'd the request data, we send a PING ACK to
    demand a response to find out whether we need to retransmit.

    If the server says it has received all of the data, we send an IDLE ACK to
    tell the server that we haven't received anything in the receive phase as
    yet.

    To make this work, a non-immediate PING ACK must carry a delay. I've chosen
    the same as the IDLE ACK for the moment.

    Signed-off-by: David Howells

    David Howells
     
  • Generate a summary of the Tx buffer packet state when an ACK is received
    for use in a later patch that does congestion management.

    Signed-off-by: David Howells

    David Howells
     
  • Clear the ACK reason, ACK timer and resend timer when entering the client
    reply phase when the first DATA packet is received. New ACKs will be
    proposed once the data is queued.

    The resend timer is no longer relevant and we need to cancel ACKs scheduled
    to probe for a lost reply.

    Signed-off-by: David Howells

    David Howells
     
  • Send an ACK if we haven't sent one for the last two packets we've received.
    This keeps the other end apprised of where we've got to - which is
    important if they're doing slow-start.

    We do this in recvmsg so that we can dispatch a packet directly without the
    need to wake up the background thread.

    This should possibly be made configurable in future.

    Signed-off-by: David Howells

    David Howells
     

23 Sep, 2016

5 commits

  • Add a tracepoint to log proposed ACKs, including whether the proposal is
    used to update a pending ACK or is discarded in favour of an easlier,
    higher priority ACK.

    Whilst we're at it, get rid of the rxrpc_acks() function and access the
    name array directly. We do, however, need to validate the ACK reason
    number given to trace_rxrpc_rx_ack() to make sure we don't overrun the
    array.

    Signed-off-by: David Howells

    David Howells
     
  • Add a tracepoint to log call timer initiation, setting and expiry.

    Signed-off-by: David Howells

    David Howells
     
  • When the last packet of data to be transmitted on a call is queued, tx_top
    is set and then the RXRPC_CALL_TX_LAST flag is set. Unfortunately, this
    leaves a race in the ACK processing side of things because the flag affects
    the interpretation of tx_top and also allows us to start receiving reply
    data before we've finished transmitting.

    To fix this, make the following changes:

    (1) rxrpc_queue_packet() now sets a marker in the annotation buffer
    instead of setting the RXRPC_CALL_TX_LAST flag.

    (2) rxrpc_rotate_tx_window() detects the marker and sets the flag in the
    same context as the routines that use it.

    (3) rxrpc_end_tx_phase() is simplified to just shift the call state.
    The Tx window must have been rotated before calling to discard the
    last packet.

    (4) rxrpc_receiving_reply() is added to handle the arrival of the first
    DATA packet of a reply to a client call (which is an implicit ACK of
    the Tx phase).

    (5) The last part of rxrpc_input_ack() is reordered to perform Tx
    rotation, then soft-ACK application and then to end the phase if we've
    rotated the last packet. In the event of a terminal ACK, the soft-ACK
    application will be skipped as nAcks should be 0.

    (6) rxrpc_input_ackall() now has to rotate as well as ending the phase.

    In addition:

    (7) Alter the transmit tracepoint to log the rotation of the last packet.

    (8) Remove the no-longer relevant queue_reqack tracepoint note. The
    ACK-REQUESTED packet header flag is now set as needed when we actually
    transmit the packet and may vary by retransmission.

    Signed-off-by: David Howells

    David Howells
     
  • When a DATA packet has its initial transmission, we may need to start or
    adjust the resend timer. Without this we end up relying on being sent a
    NACK to initiate the resend.

    Signed-off-by: David Howells

    David Howells
     
  • Make sure that sendmsg() gets woken up if the call it is waiting for
    completes abnormally.

    Signed-off-by: David Howells

    David Howells
     

22 Sep, 2016

3 commits

  • Reduce the number of ACK-Requests we set on DATA packets that we're sending
    to reduce network traffic. We set the flag on odd-numbered DATA packets to
    start off the RTT cache until we have at least three entries in it and then
    probe once per second thereafter to keep it topped up.

    This could be made tunable in future.

    Note that from this point, the RXRPC_REQUEST_ACK flag is set on DATA
    packets as we transmit them and not stored statically in the sk_buff.

    Signed-off-by: David Howells

    David Howells
     
  • In addition to sending a PING ACK to gain RTT data, we can set the
    RXRPC_REQUEST_ACK flag on a DATA packet and get a REQUESTED-ACK ACK. The
    ACK packet contains the serial number of the packet it is in response to,
    so we can look through the Tx buffer for a matching DATA packet.

    This requires that the data packets be stamped with the time of
    transmission as a ktime rather than having the resend_at time in jiffies.

    This further requires the resend code to do the resend determination in
    ktimes and convert to jiffies to set the timer.

    Signed-off-by: David Howells

    David Howells
     
  • Send a PING ACK packet to the peer when we get a new incoming call from a
    peer we don't have a record for. The PING RESPONSE ACK packet will tell us
    the following about the peer:

    (1) its receive window size

    (2) its MTU sizes

    (3) its support for jumbo DATA packets

    (4) if it supports slow start (similar to RFC 5681)

    (5) an estimate of the RTT

    This is necessary because the peer won't normally send us an ACK until it
    gets to the Rx phase and we send it a packet, but we would like to know
    some of this information before we start sending packets.

    A pair of tracepoints are added so that RTT determination can be observed.

    Signed-off-by: David Howells

    David Howells