09 Sep, 2020

1 commit

  • Rewrite the rxrpc client connection manager so that it can support multiple
    connections for a given security key to a peer. The following changes are
    made:

    (1) For each open socket, the code currently maintains an rbtree with the
    connections placed into it, keyed by communications parameters. This
    is tricky to maintain as connections can be culled from the tree or
    replaced within it. Connections can require replacement for a number
    of reasons, e.g. their IDs span too great a range for the IDR data
    type to represent efficiently, the call ID numbers on that conn would
    overflow or the conn got aborted.

    This is changed so that there's now a connection bundle object placed
    in the tree, keyed on the same parameters. The bundle, however, does
    not need to be replaced.

    (2) An rxrpc_bundle object can now manage the available channels for a set
    of parallel connections. The lock that manages this is moved there
    from the rxrpc_connection struct (channel_lock).

    (3) There'a a dummy bundle for all incoming connections to share so that
    they have a channel_lock too. It might be better to give each
    incoming connection its own bundle. This bundle is not needed to
    manage which channels incoming calls are made on because that's the
    solely at whim of the client.

    (4) The restrictions on how many client connections are around are
    removed. Instead, a previous patch limits the number of client calls
    that can be allocated. Ordinarily, client connections are reaped
    after 2 minutes on the idle queue, but when more than a certain number
    of connections are in existence, the reaper starts reaping them after
    2s of idleness instead to get the numbers back down.

    It could also be made such that new call allocations are forced to
    wait until the number of outstanding connections subsides.

    Signed-off-by: David Howells

    David Howells
     

31 Jul, 2020

1 commit

  • There's a race between rxrpc_sendmsg setting up a call, but then failing to
    send anything on it due to an error, and recvmsg() seeing the call
    completion occur and trying to return the state to the user.

    An assertion fails in rxrpc_recvmsg() because the call has already been
    released from the socket and is about to be released again as recvmsg deals
    with it. (The recvmsg_q queue on the socket holds a ref, so there's no
    problem with use-after-free.)

    We also have to be careful not to end up reporting an error twice, in such
    a way that both returns indicate to userspace that the user ID supplied
    with the call is no longer in use - which could cause the client to
    malfunction if it recycles the user ID fast enough.

    Fix this by the following means:

    (1) When sendmsg() creates a call after the point that the call has been
    successfully added to the socket, don't return any errors through
    sendmsg(), but rather complete the call and let recvmsg() retrieve
    them. Make sendmsg() return 0 at this point. Further calls to
    sendmsg() for that call will fail with ESHUTDOWN.

    Note that at this point, we haven't send any packets yet, so the
    server doesn't yet know about the call.

    (2) If sendmsg() returns an error when it was expected to create a new
    call, it means that the user ID wasn't used.

    (3) Mark the call disconnected before marking it completed to prevent an
    oops in rxrpc_release_call().

    (4) recvmsg() will then retrieve the error and set MSG_EOR to indicate
    that the user ID is no longer known by the kernel.

    An oops like the following is produced:

    kernel BUG at net/rxrpc/recvmsg.c:605!
    ...
    RIP: 0010:rxrpc_recvmsg+0x256/0x5ae
    ...
    Call Trace:
    ? __init_waitqueue_head+0x2f/0x2f
    ____sys_recvmsg+0x8a/0x148
    ? import_iovec+0x69/0x9c
    ? copy_msghdr_from_user+0x5c/0x86
    ___sys_recvmsg+0x72/0xaa
    ? __fget_files+0x22/0x57
    ? __fget_light+0x46/0x51
    ? fdget+0x9/0x1b
    do_recvmmsg+0x15e/0x232
    ? _raw_spin_unlock+0xa/0xb
    ? vtime_delta+0xf/0x25
    __x64_sys_recvmmsg+0x2c/0x2f
    do_syscall_64+0x4c/0x78
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: 357f5ef64628 ("rxrpc: Call rxrpc_release_call() on error in rxrpc_new_client_call()")
    Reported-by: syzbot+b54969381df354936d96@syzkaller.appspotmail.com
    Signed-off-by: David Howells
    Reviewed-by: Marc Dionne
    Signed-off-by: David S. Miller

    David Howells
     

07 Feb, 2020

1 commit

  • The recent patch that substituted a flag on an rxrpc_call for the
    connection pointer being NULL as an indication that a call was disconnected
    puts the set_bit in the wrong place for service calls. This is only a
    problem if a call is implicitly terminated by a new call coming in on the
    same connection channel instead of a terminating ACK packet.

    In such a case, rxrpc_input_implicit_end_call() calls
    __rxrpc_disconnect_call(), which is now (incorrectly) setting the
    disconnection bit, meaning that when rxrpc_release_call() is later called,
    it doesn't call rxrpc_disconnect_call() and so the call isn't removed from
    the peer's error distribution list and the list gets corrupted.

    KASAN finds the issue as an access after release on a call, but the
    position at which it occurs is confusing as it appears to be related to a
    different call (the call site is where the latter call is being removed
    from the error distribution list and either the next or pprev pointer
    points to a previously released call).

    Fix this by moving the setting of the flag from __rxrpc_disconnect_call()
    to rxrpc_disconnect_call() in the same place that the connection pointer
    was being cleared.

    Fixes: 5273a191dca6 ("rxrpc: Fix NULL pointer deref due to call->conn being cleared on disconnect")
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

03 Feb, 2020

1 commit

  • When a call is disconnected, the connection pointer from the call is
    cleared to make sure it isn't used again and to prevent further attempted
    transmission for the call. Unfortunately, there might be a daemon trying
    to use it at the same time to transmit a packet.

    Fix this by keeping call->conn set, but setting a flag on the call to
    indicate disconnection instead.

    Remove also the bits in the transmission functions where the conn pointer is
    checked and a ref taken under spinlock as this is now redundant.

    Fixes: 8d94aa381dab ("rxrpc: Calls shouldn't hold socket refs")
    Signed-off-by: David Howells

    David Howells
     

07 Oct, 2019

1 commit

  • rxrpc_put_*conn() calls trace_rxrpc_conn() after they have done the
    decrement of the refcount - which looks at the debug_id in the connection
    record. But unless the refcount was reduced to zero, we no longer have the
    right to look in the record and, indeed, it may be deleted by some other
    thread.

    Fix this by getting the debug_id out before decrementing the refcount and
    then passing that into the tracepoint.

    Fixes: 363deeab6d0f ("rxrpc: Add connection tracepoint and client conn state tracepoint")
    Signed-off-by: David Howells

    David Howells
     

31 Aug, 2019

1 commit

  • When a local endpoint is ceases to be in use, such as when the kafs module
    is unloaded, the kernel will emit an assertion failure if there are any
    outstanding client connections:

    rxrpc: Assertion failed
    ------------[ cut here ]------------
    kernel BUG at net/rxrpc/local_object.c:433!

    and even beyond that, will evince other oopses if there are service
    connections still present.

    Fix this by:

    (1) Removing the triggering of connection reaping when an rxrpc socket is
    released. These don't actually clean up the connections anyway - and
    further, the local endpoint may still be in use through another
    socket.

    (2) Mark the local endpoint as dead when we start the process of tearing
    it down.

    (3) When destroying a local endpoint, strip all of its client connections
    from the idle list and discard the ref on each that the list was
    holding.

    (4) When destroying a local endpoint, call the service connection reaper
    directly (rather than through a workqueue) to immediately kill off all
    outstanding service connections.

    (5) Make the service connection reaper reap connections for which the
    local endpoint is marked dead.

    Only after destroying the connections can we close the socket lest we get
    an oops in a workqueue that's looking at a connection or a peer.

    Fixes: 3d18cbb7fd0c ("rxrpc: Fix conn expiry timers")
    Signed-off-by: David Howells
    Tested-by: Marc Dionne
    Signed-off-by: David S. Miller

    David Howells
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

04 Oct, 2018

2 commits

  • rxrpc_extract_addr_from_skb() doesn't use the argument that points to the
    local endpoint, so remove the argument.

    Signed-off-by: David Howells

    David Howells
     
  • AF_RXRPC opens an IPv6 socket through which to send and receive network
    packets, both IPv6 and IPv4. It currently turns AF_INET addresses into
    AF_INET-as-AF_INET6 addresses based on an assumption that this was
    necessary; on further inspection of the code, however, it turns out that
    the IPv6 code just farms packets aimed at AF_INET addresses out to the IPv4
    code.

    Fix AF_RXRPC to use AF_INET addresses directly when given them.

    Fixes: 7b674e390e51 ("rxrpc: Fix IPv6 support")
    Signed-off-by: David Howells

    David Howells
     

28 Sep, 2018

3 commits

  • Fix error distribution by immediately delivering the errors to all the
    affected calls rather than deferring them to a worker thread. The problem
    with the latter is that retries and things can happen in the meantime when we
    want to stop that sooner.

    To this end:

    (1) Stop the error distributor from removing calls from the error_targets
    list so that peer->lock isn't needed to synchronise against other adds
    and removals.

    (2) Require the peer's error_targets list to be accessed with RCU, thereby
    avoiding the need to take peer->lock over distribution.

    (3) Don't attempt to affect a call's state if it is already marked complete.

    Signed-off-by: David Howells

    David Howells
     
  • Make the following changes to improve the robustness of the code that sets
    up a new service call:

    (1) Cache the rxrpc_sock struct obtained in rxrpc_data_ready() to do a
    service ID check and pass that along to rxrpc_new_incoming_call().
    This means that I can remove the check from rxrpc_new_incoming_call()
    without the need to worry about the socket attached to the local
    endpoint getting replaced - which would invalidate the check.

    (2) Cache the rxrpc_peer struct, thereby allowing the peer search to be
    done once. The peer is passed to rxrpc_new_incoming_call(), thereby
    saving the need to repeat the search.

    This also reduces the possibility of rxrpc_publish_service_conn()
    BUG()'ing due to the detection of a duplicate connection, despite the
    initial search done by rxrpc_find_connection_rcu() having turned up
    nothing.

    This BUG() shouldn't ever get hit since rxrpc_data_ready() *should* be
    non-reentrant and the result of the initial search should still hold
    true, but it has proven possible to hit.

    I *think* this may be due to __rxrpc_lookup_peer_rcu() cutting short
    the iteration over the hash table if it finds a matching peer with a
    zero usage count, but I don't know for sure since it's only ever been
    hit once that I know of.

    Another possibility is that a bug in rxrpc_data_ready() that checked
    the wrong byte in the header for the RXRPC_CLIENT_INITIATED flag
    might've let through a packet that caused a spurious and invalid call
    to be set up. That is addressed in another patch.

    (3) Fix __rxrpc_lookup_peer_rcu() to skip peer records that have a zero
    usage count rather than stopping and returning not found, just in case
    there's another peer record behind it in the bucket.

    (4) Don't search the peer records in rxrpc_alloc_incoming_call(), but
    rather either use the peer cached in (2) or, if one wasn't found,
    preemptively install a new one.

    Fixes: 8496af50eb38 ("rxrpc: Use RCU to access a peer's service connection tree")
    Signed-off-by: David Howells

    David Howells
     
  • There's a check in rxrpc_data_ready() that's checking the CLIENT_INITIATED
    flag in the packet type field rather than in the packet flags field.

    Fix this by creating a pair of helper functions to check whether the packet
    is going to the client or to the server and use them generally.

    Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
    Signed-off-by: David Howells

    David Howells
     

27 Sep, 2018

1 commit


21 Jun, 2018

1 commit

  • While __atomic_add_unless() was originally intended as a building-block
    for atomic_add_unless(), it's now used in a number of places around the
    kernel. It's the only common atomic operation named __atomic*(), rather
    than atomic_*(), and for consistency it would be better named
    atomic_fetch_add_unless().

    This lack of consistency is slightly confusing, and gets in the way of
    scripting atomics. Given that, let's clean things up and promote it to
    an official part of the atomics API, in the form of
    atomic_fetch_add_unless().

    This patch converts definitions and invocations over to the new name,
    including the instrumented version, using the following script:

    ----
    git grep -w __atomic_add_unless | while read line; do
    sed -i '{s/\/atomic_fetch_add_unless/}' "${line%%:*}";
    done
    git grep -w __arch_atomic_add_unless | while read line; do
    sed -i '{s/\/arch_atomic_fetch_add_unless/}' "${line%%:*}";
    done
    ----

    Note that we do not have atomic{64,_long}_fetch_add_unless(), which will
    be introduced by later patches.

    There should be no functional change as a result of this patch.

    Signed-off-by: Mark Rutland
    Reviewed-by: Will Deacon
    Acked-by: Geert Uytterhoeven
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Palmer Dabbelt
    Cc: Boqun Feng
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Link: https://lore.kernel.org/lkml/20180621121321.4761-2-mark.rutland@arm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
     

04 Apr, 2018

1 commit

  • Pull networking updates from David Miller:

    1) Support offloading wireless authentication to userspace via
    NL80211_CMD_EXTERNAL_AUTH, from Srinivas Dasari.

    2) A lot of work on network namespace setup/teardown from Kirill Tkhai.
    Setup and cleanup of namespaces now all run asynchronously and thus
    performance is significantly increased.

    3) Add rx/tx timestamping support to mv88e6xxx driver, from Brandon
    Streiff.

    4) Support zerocopy on RDS sockets, from Sowmini Varadhan.

    5) Use denser instruction encoding in x86 eBPF JIT, from Daniel
    Borkmann.

    6) Support hw offload of vlan filtering in mvpp2 dreiver, from Maxime
    Chevallier.

    7) Support grafting of child qdiscs in mlxsw driver, from Nogah
    Frankel.

    8) Add packet forwarding tests to selftests, from Ido Schimmel.

    9) Deal with sub-optimal GSO packets better in BBR congestion control,
    from Eric Dumazet.

    10) Support 5-tuple hashing in ipv6 multipath routing, from David Ahern.

    11) Add path MTU tests to selftests, from Stefano Brivio.

    12) Various bits of IPSEC offloading support for mlx5, from Aviad
    Yehezkel, Yossi Kuperman, and Saeed Mahameed.

    13) Support RSS spreading on ntuple filters in SFC driver, from Edward
    Cree.

    14) Lots of sockmap work from John Fastabend. Applications can use eBPF
    to filter sendmsg and sendpage operations.

    15) In-kernel receive TLS support, from Dave Watson.

    16) Add XDP support to ixgbevf, this is significant because it should
    allow optimized XDP usage in various cloud environments. From Tony
    Nguyen.

    17) Add new Intel E800 series "ice" ethernet driver, from Anirudh
    Venkataramanan et al.

    18) IP fragmentation match offload support in nfp driver, from Pieter
    Jansen van Vuuren.

    19) Support XDP redirect in i40e driver, from Björn Töpel.

    20) Add BPF_RAW_TRACEPOINT program type for accessing the arguments of
    tracepoints in their raw form, from Alexei Starovoitov.

    21) Lots of striding RQ improvements to mlx5 driver with many
    performance improvements, from Tariq Toukan.

    22) Use rhashtable for inet frag reassembly, from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1678 commits)
    net: mvneta: improve suspend/resume
    net: mvneta: split rxq/txq init and txq deinit into SW and HW parts
    ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh
    net: bgmac: Fix endian access in bgmac_dma_tx_ring_free()
    net: bgmac: Correctly annotate register space
    route: check sysctl_fib_multipath_use_neigh earlier than hash
    fix typo in command value in drivers/net/phy/mdio-bitbang.
    sky2: Increase D3 delay to sky2 stops working after suspend
    net/mlx5e: Set EQE based as default TX interrupt moderation mode
    ibmvnic: Disable irqs before exiting reset from closed state
    net: sched: do not emit messages while holding spinlock
    vlan: also check phy_driver ts_info for vlan's real device
    Bluetooth: Mark expected switch fall-throughs
    Bluetooth: Set HCI_QUIRK_SIMULTANEOUS_DISCOVERY for BTUSB_QCA_ROME
    Bluetooth: btrsi: remove unused including
    Bluetooth: hci_bcm: Remove DMI quirk for the MINIX Z83-4
    sh_eth: kill useless check in __sh_eth_get_regs()
    sh_eth: add sh_eth_cpu_data::no_xdfar flag
    ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()
    ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()
    ...

    Linus Torvalds
     

31 Mar, 2018

2 commits

  • rxrpc_local objects cannot be disposed of until all the connections that
    point to them have been RCU'd as a connection object holds refcount on the
    local endpoint it is communicating through. Currently, this can cause an
    assertion failure to occur when a network namespace is destroyed as there's
    no check that the RCU destructors for the connections have been run before
    we start trying to destroy local endpoints.

    The kernel reports:

    rxrpc: AF_RXRPC: Leaked local 0000000036a41bc1 {5}
    ------------[ cut here ]------------
    kernel BUG at ../net/rxrpc/local_object.c:439!

    Fix this by keeping a count of the live connections and waiting for it to
    go to zero at the end of rxrpc_destroy_all_connections().

    Fixes: dee46364ce6f ("rxrpc: Add RCU destruction for connections and calls")
    Signed-off-by: David Howells

    David Howells
     
  • Fix various issues detected by checker.

    Errors:

    (*) rxrpc_discard_prealloc() should be using rcu_assign_pointer to set
    call->socket.

    Warnings:

    (*) rxrpc_service_connection_reaper() should be passing NULL rather than 0 to
    trace_rxrpc_conn() as the where argument.

    (*) rxrpc_disconnect_client_call() should get its net pointer via the
    call->conn rather than call->sock to avoid a warning about accessing
    an RCU pointer without protection.

    (*) Proc seq start/stop functions need annotation as they pass locks
    between the functions.

    False positives:

    (*) Checker doesn't correctly handle of seq-retry lock context balance in
    rxrpc_find_service_conn_rcu().

    (*) Checker thinks execution may proceed past the BUG() in
    rxrpc_publish_service_conn().

    (*) Variable length array warnings from SKCIPHER_REQUEST_ON_STACK() in
    rxkad.c.

    Signed-off-by: David Howells

    David Howells
     

08 Feb, 2018

1 commit

  • AF_RXRPC is incorrectly sending back to the server any abort it receives
    for a client connection. This is due to the final-ACK offload to the
    connection event processor patch. The abort code is copied into the
    last-call information on the connection channel and then the event
    processor is set.

    Instead, the following should be done:

    (1) In the case of a final-ACK for a successful call, the ACK should be
    scheduled as before.

    (2) In the case of a locally generated ABORT, the ABORT details should be
    cached for sending in response to further packets related to that
    call and no further action scheduled at call disconnect time.

    (3) In the case of an ACK received from the peer, the call should be
    considered dead, no ABORT should be transmitted at this time. In
    response to further non-ABORT packets from the peer relating to this
    call, an RX_USER_ABORT ABORT should be transmitted.

    (4) In the case of a call killed due to network error, an RX_USER_ABORT
    ABORT should be cached for transmission in response to further
    packets, but no ABORT should be sent at this time.

    Fixes: 3136ef49a14c ("rxrpc: Delay terminal ACK transmission on a client call")
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

29 Nov, 2017

1 commit


24 Nov, 2017

3 commits

  • Fix the rxrpc connection expiry timers so that connections for closed
    AF_RXRPC sockets get deleted in a more timely fashion, freeing up the
    transport UDP port much more quickly.

    (1) Replace the delayed work items with work items plus timers so that
    timer_reduce() can be used to shorten them and so that the timer
    doesn't requeue the work item if the net namespace is dead.

    (2) Don't use queue_delayed_work() as that won't alter the timeout if the
    timer is already running.

    (3) Don't rearm the timers if the network namespace is dead.

    Signed-off-by: David Howells

    David Howells
     
  • RxRPC service endpoints expire like they're supposed to by the following
    means:

    (1) Mark dead rxrpc_net structs (with ->live) rather than twiddling the
    global service conn timeout, otherwise the first rxrpc_net struct to
    die will cause connections on all others to expire immediately from
    then on.

    (2) Mark local service endpoints for which the socket has been closed
    (->service_closed) so that the expiration timeout can be much
    shortened for service and client connections going through that
    endpoint.

    (3) rxrpc_put_service_conn() needs to schedule the reaper when the usage
    count reaches 1, not 0, as idle conns have a 1 count.

    (4) The accumulator for the earliest time we might want to schedule for
    should be initialised to jiffies + MAX_JIFFY_OFFSET, not ULONG_MAX as
    the comparison functions use signed arithmetic.

    (5) Simplify the expiration handling, adding the expiration value to the
    idle timestamp each time rather than keeping track of the time in the
    past before which the idle timestamp must go to be expired. This is
    much easier to read.

    (6) Ignore the timeouts if the net namespace is dead.

    (7) Restart the service reaper work item rather the client reaper.

    Signed-off-by: David Howells

    David Howells
     
  • Delay terminal ACK transmission on a client call by deferring it to the
    connection processor. This allows it to be skipped if we can send the next
    call instead, the first DATA packet of which will implicitly ack this call.

    Signed-off-by: David Howells

    David Howells
     

29 Aug, 2017

1 commit

  • Fix IPv6 support in AF_RXRPC in the following ways:

    (1) When extracting the address from a received IPv4 packet, if the local
    transport socket is open for IPv6 then fill out the sockaddr_rxrpc
    struct for an IPv4-mapped-to-IPv6 AF_INET6 transport address instead
    of an AF_INET one.

    (2) When sending CHALLENGE or RESPONSE packets, the transport length needs
    to be set from the sockaddr_rxrpc::transport_len field rather than
    sizeof() on the IPv4 transport address.

    (3) When processing an IPv4 ICMP packet received by an IPv6 socket, set up
    the address correctly before searching for the affected peer.

    Signed-off-by: David Howells

    David Howells
     

15 Jun, 2017

1 commit

  • Cache the congestion window setting that was determined during a call's
    transmission phase when it finishes so that it can be used by the next call
    to the same peer, thereby shortcutting the slow-start algorithm.

    The value is stored in the rxrpc_peer struct and is accessed without
    locking. Each call takes the value that happens to be there when it starts
    and just overwrites the value when it finishes.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

05 Jun, 2017

1 commit

  • Keep the rxrpc_connection struct's idea of the service ID that is exposed
    in the protocol separate from the service ID that's used as a lookup key.

    This allows the protocol service ID on a client connection to get upgraded
    without making the connection unfindable for other client calls that also
    would like to use the upgraded connection.

    The connection's actual service ID is then returned through recvmsg() by
    way of msg_name.

    Whilst we're at it, we get rid of the last_service_id field from each
    channel. The service ID is per-connection, not per-call and an entire
    connection is upgraded in one go.

    Signed-off-by: David Howells

    David Howells
     

26 May, 2017

1 commit

  • Support network namespacing in AF_RXRPC with the following changes:

    (1) All the local endpoint, peer and call lists, locks, counters, etc. are
    moved into the per-namespace record.

    (2) All the connection tracking is moved into the per-namespace record
    with the exception of the client connection ID tree, which is kept
    global so that connection IDs are kept unique per-machine.

    (3) Each namespace gets its own epoch. This allows each network namespace
    to pretend to be a separate client machine.

    (4) The /proc/net/rxrpc_xxx files are now called /proc/net/rxrpc/xxx and
    the contents reflect the namespace.

    fs/afs/ should be okay with this patch as it explicitly requires the current
    net namespace to be init_net to permit a mount to proceed at the moment. It
    will, however, need updating so that cells, IP addresses and DNS records are
    per-namespace also.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

05 Jan, 2017

1 commit

  • Add the following extra tracing information:

    (1) Modify the rxrpc_transmit tracepoint to record the Tx window size as
    this is varied by the slow-start algorithm.

    (2) Modify the rxrpc_rx_ack tracepoint to record more information from
    received ACK packets.

    (3) Add an rxrpc_rx_data tracepoint to record the information in DATA
    packets.

    (4) Add an rxrpc_disconnect_call tracepoint to record call disconnection,
    including the reason the call was disconnected.

    (5) Add an rxrpc_improper_term tracepoint to record implicit termination
    of a call by a client either by starting a new call on a particular
    connection channel without first transmitting the final ACK for the
    previous call.

    Signed-off-by: David Howells

    David Howells
     

22 Sep, 2016

1 commit

  • Don't store the rxrpc protocol header in sk_buffs on the transmit queue,
    but rather generate it on the fly and pass it to kernel_sendmsg() as a
    separate iov. This reduces the amount of storage required.

    Note that the security header is still stored in the sk_buff as it may get
    encrypted along with the data (and doesn't change with each transmission).

    Signed-off-by: David Howells

    David Howells
     

17 Sep, 2016

2 commits


14 Sep, 2016

1 commit

  • Add IPv6 support to AF_RXRPC. With this, AF_RXRPC sockets can be created:

    service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET6);

    instead of:

    service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);

    The AFS filesystem doesn't support IPv6 at the moment, though, since that
    requires upgrades to some of the RPC calls.

    Note that a good portion of this patch is replacing "%pI4:%u" in print
    statements with "%pISpc" which is able to handle both protocols and print
    the port.

    Signed-off-by: David Howells

    David Howells
     

08 Sep, 2016

2 commits

  • Rewrite the data and ack handling code such that:

    (1) Parsing of received ACK and ABORT packets and the distribution and the
    filing of DATA packets happens entirely within the data_ready context
    called from the UDP socket. This allows us to process and discard ACK
    and ABORT packets much more quickly (they're no longer stashed on a
    queue for a background thread to process).

    (2) We avoid calling skb_clone(), pskb_pull() and pskb_trim(). We instead
    keep track of the offset and length of the content of each packet in
    the sk_buff metadata. This means we don't do any allocation in the
    receive path.

    (3) Jumbo DATA packet parsing is now done in data_ready context. Rather
    than cloning the packet once for each subpacket and pulling/trimming
    it, we file the packet multiple times with an annotation for each
    indicating which subpacket is there. From that we can directly
    calculate the offset and length.

    (4) A call's receive queue can be accessed without taking locks (memory
    barriers do have to be used, though).

    (5) Incoming calls are set up from preallocated resources and immediately
    made live. They can than have packets queued upon them and ACKs
    generated. If insufficient resources exist, DATA packet #1 is given a
    BUSY reply and other DATA packets are discarded).

    (6) sk_buffs no longer take a ref on their parent call.

    To make this work, the following changes are made:

    (1) Each call's receive buffer is now a circular buffer of sk_buff
    pointers (rxtx_buffer) rather than a number of sk_buff_heads spread
    between the call and the socket. This permits each sk_buff to be in
    the buffer multiple times. The receive buffer is reused for the
    transmit buffer.

    (2) A circular buffer of annotations (rxtx_annotations) is kept parallel
    to the data buffer. Transmission phase annotations indicate whether a
    buffered packet has been ACK'd or not and whether it needs
    retransmission.

    Receive phase annotations indicate whether a slot holds a whole packet
    or a jumbo subpacket and, if the latter, which subpacket. They also
    note whether the packet has been decrypted in place.

    (3) DATA packet window tracking is much simplified. Each phase has just
    two numbers representing the window (rx_hard_ack/rx_top and
    tx_hard_ack/tx_top).

    The hard_ack number is the sequence number before base of the window,
    representing the last packet the other side says it has consumed.
    hard_ack starts from 0 and the first packet is sequence number 1.

    The top number is the sequence number of the highest-numbered packet
    residing in the buffer. Packets between hard_ack+1 and top are
    soft-ACK'd to indicate they've been received, but not yet consumed.

    Four macros, before(), before_eq(), after() and after_eq() are added
    to compare sequence numbers within the window. This allows for the
    top of the window to wrap when the hard-ack sequence number gets close
    to the limit.

    Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also
    to indicate when rx_top and tx_top point at the packets with the
    LAST_PACKET bit set, indicating the end of the phase.

    (4) Calls are queued on the socket 'receive queue' rather than packets.
    This means that we don't need have to invent dummy packets to queue to
    indicate abnormal/terminal states and we don't have to keep metadata
    packets (such as ABORTs) around

    (5) The offset and length of a (sub)packet's content are now passed to
    the verify_packet security op. This is currently expected to decrypt
    the packet in place and validate it.

    However, there's now nowhere to store the revised offset and length of
    the actual data within the decrypted blob (there may be a header and
    padding to skip) because an sk_buff may represent multiple packets, so
    a locate_data security op is added to retrieve these details from the
    sk_buff content when needed.

    (6) recvmsg() now has to handle jumbo subpackets, where each subpacket is
    individually secured and needs to be individually decrypted. The code
    to do this is broken out into rxrpc_recvmsg_data() and shared with the
    kernel API. It now iterates over the call's receive buffer rather
    than walking the socket receive queue.

    Additional changes:

    (1) The timers are condensed to a single timer that is set for the soonest
    of three timeouts (delayed ACK generation, DATA retransmission and
    call lifespan).

    (2) Transmission of ACK and ABORT packets is effected immediately from
    process-context socket ops/kernel API calls that cause them instead of
    them being punted off to a background work item. The data_ready
    handler still has to defer to the background, though.

    (3) A shutdown op is added to the AF_RXRPC socket so that the AFS
    filesystem can shut down the socket and flush its own work items
    before closing the socket to deal with any in-progress service calls.

    Future additional changes that will need to be considered:

    (1) Make sure that a call doesn't hog the front of the queue by receiving
    data from the network as fast as userspace is consuming it to the
    exclusion of other calls.

    (2) Transmit delayed ACKs from within recvmsg() when we've consumed
    sufficiently more packets to avoid the background work item needing to
    run.

    Signed-off-by: David Howells

    David Howells
     
  • Make it possible for the data_ready handler called from the UDP transport
    socket to completely instantiate an rxrpc_call structure and make it
    immediately live by preallocating all the memory it might need. The idea
    is to cut out the background thread usage as much as possible.

    [Note that the preallocated structs are not actually used in this patch -
    that will be done in a future patch.]

    If insufficient resources are available in the preallocation buffers, it
    will be possible to discard the DATA packet in the data_ready handler or
    schedule a BUSY packet without the need to schedule an attempt at
    allocation in a background thread.

    To this end:

    (1) Preallocate rxrpc_peer, rxrpc_connection and rxrpc_call structs to a
    maximum number each of the listen backlog size. The backlog size is
    limited to a maxmimum of 32. Only this many of each can be in the
    preallocation buffer.

    (2) For userspace sockets, the preallocation is charged initially by
    listen() and will be recharged by accepting or rejecting pending
    new incoming calls.

    (3) For kernel services {,re,dis}charging of the preallocation buffers is
    handled manually. Two notifier callbacks have to be provided before
    kernel_listen() is invoked:

    (a) An indication that a new call has been instantiated. This can be
    used to trigger background recharging.

    (b) An indication that a call is being discarded. This is used when
    the socket is being released.

    A function, rxrpc_kernel_charge_accept() is called by the kernel
    service to preallocate a single call. It should be passed the user ID
    to be used for that call and a callback to associate the rxrpc call
    with the kernel service's side of the ID.

    (4) Discard the preallocation when the socket is closed.

    (5) Temporarily bump the refcount on the call allocated in
    rxrpc_incoming_call() so that rxrpc_release_call() can ditch the
    preallocation ref on service calls unconditionally. This will no
    longer be necessary once the preallocation is used.

    Note that this does not yet control the number of active service calls on a
    client - that will come in a later patch.

    A future development would be to provide a setsockopt() call that allows a
    userspace server to manually charge the preallocation buffer. This would
    allow user call IDs to be provided in advance and the awkward manual accept
    stage to be bypassed.

    Signed-off-by: David Howells

    David Howells
     

30 Aug, 2016

1 commit

  • Condense the terminal states of a call state machine to a single state,
    plus a separate completion type value. The value is then set, along with
    error and abort code values, only when the call is transitioned to the
    completion state.

    Helpers are provided to simplify this.

    Signed-off-by: David Howells

    David Howells
     

24 Aug, 2016

2 commits

  • Improve the management and caching of client rxrpc connection objects.
    From this point, client connections will be managed separately from service
    connections because AF_RXRPC controls the creation and re-use of client
    connections but doesn't have that luxury with service connections.

    Further, there will be limits on the numbers of client connections that may
    be live on a machine. No direct restriction will be placed on the number
    of client calls, excepting that each client connection can support a
    maximum of four concurrent calls.

    Note that, for a number of reasons, we don't want to simply discard a
    client connection as soon as the last call is apparently finished:

    (1) Security is negotiated per-connection and the context is then shared
    between all calls on that connection. The context can be negotiated
    again if the connection lapses, but that involves holding up calls
    whilst at least two packets are exchanged and various crypto bits are
    performed - so we'd ideally like to cache it for a little while at
    least.

    (2) If a packet goes astray, we will need to retransmit a final ACK or
    ABORT packet. To make this work, we need to keep around the
    connection details for a little while.

    (3) The locally held structures represent some amount of setup time, to be
    weighed against their occupation of memory when idle.

    To this end, the client connection cache is managed by a state machine on
    each connection. There are five states:

    (1) INACTIVE - The connection is not held in any list and may not have
    been exposed to the world. If it has been previously exposed, it was
    discarded from the idle list after expiring.

    (2) WAITING - The connection is waiting for the number of client conns to
    drop below the maximum capacity. Calls may be in progress upon it
    from when it was active and got culled.

    The connection is on the rxrpc_waiting_client_conns list which is kept
    in to-be-granted order. Culled conns with waiters go to the back of
    the queue just like new conns.

    (3) ACTIVE - The connection has at least one call in progress upon it, it
    may freely grant available channels to new calls and calls may be
    waiting on it for channels to become available.

    The connection is on the rxrpc_active_client_conns list which is kept
    in activation order for culling purposes.

    (4) CULLED - The connection got summarily culled to try and free up
    capacity. Calls currently in progress on the connection are allowed
    to continue, but new calls will have to wait. There can be no waiters
    in this state - the conn would have to go to the WAITING state
    instead.

    (5) IDLE - The connection has no calls in progress upon it and must have
    been exposed to the world (ie. the EXPOSED flag must be set). When it
    expires, the EXPOSED flag is cleared and the connection transitions to
    the INACTIVE state.

    The connection is on the rxrpc_idle_client_conns list which is kept in
    order of how soon they'll expire.

    A connection in the ACTIVE or CULLED state must have at least one active
    call upon it; if in the WAITING state it may have active calls upon it;
    other states may not have active calls.

    As long as a connection remains active and doesn't get culled, it may
    continue to process calls - even if there are connections on the wait
    queue. This simplifies things a bit and reduces the amount of checking we
    need do.

    There are a couple flags of relevance to the cache:

    (1) EXPOSED - The connection ID got exposed to the world. If this flag is
    set, an extra ref is added to the connection preventing it from being
    reaped when it has no calls outstanding. This flag is cleared and the
    ref dropped when a conn is discarded from the idle list.

    (2) DONT_REUSE - The connection should be discarded as soon as possible and
    should not be reused.

    This commit also provides a number of new settings:

    (*) /proc/net/rxrpc/max_client_conns

    The maximum number of live client connections. Above this number, new
    connections get added to the wait list and must wait for an active
    conn to be culled. Culled connections can be reused, but they will go
    to the back of the wait list and have to wait.

    (*) /proc/net/rxrpc/reap_client_conns

    If the number of desired connections exceeds the maximum above, the
    active connection list will be culled until there are only this many
    left in it.

    (*) /proc/net/rxrpc/idle_conn_expiry

    The normal expiry time for a client connection, provided there are
    fewer than reap_client_conns of them around.

    (*) /proc/net/rxrpc/idle_conn_fast_expiry

    The expedited expiry time, used when there are more than
    reap_client_conns of them around.

    Note that I combined the Tx wait queue with the channel grant wait queue to
    save space as only one of these should be in use at once.

    Note also that, for the moment, the service connection cache still uses the
    old connection management code.

    Signed-off-by: David Howells

    David Howells
     
  • The main connection list is used for two independent purposes: primarily it
    is used to find connections to reap and secondarily it is used to list
    connections in procfs.

    Split the procfs list out from the reap list. This allows us to stop using
    the reap list for client connections when they acquire a separate
    management strategy from service collections.

    The client connections will not be on a management single list, and sometimes
    won't be on a management list at all. This doesn't leave them floating,
    however, as they will also be on an rb-tree rooted on the socket so that the
    socket can find them to dispatch calls.

    Signed-off-by: David Howells

    David Howells
     

23 Aug, 2016

3 commits

  • Perform terminal call ACK/ABORT retransmission in the connection processor
    rather than in the call processor. With this change, once last_call is
    set, no more incoming packets will be routed to the corresponding call or
    any earlier calls on that channel (call IDs must only increase on a channel
    on a connection).

    Further, if a packet's callNumber is before the last_call ID or a packet is
    aimed at successfully completed service call then that packet is discarded
    and ignored.

    Signed-off-by: David Howells

    David Howells
     
  • Set the connection expiry time when a connection becomes idle rather than
    doing this in rxrpc_put_connection(). This makes the put path more
    efficient (it is likely to be called occasionally whilst a connection has
    outstanding calls because active workqueue items needs to be given a ref).

    The time is also preset in the connection allocator in case the connection
    never gets used.

    Signed-off-by: David Howells

    David Howells
     
  • Drop the channel number (channel) field from the rxrpc_call struct to
    reduce the size of the call struct. The field is redundant: if the call is
    attached to a connection, the channel can be obtained from there by AND'ing
    with RXRPC_CHANNELMASK.

    Signed-off-by: David Howells

    David Howells
     

06 Jul, 2016

1 commit

  • Move to using RCU access to a peer's service connection tree when routing
    an incoming packet. This is done using a seqlock to trigger retrying of
    the tree walk if a change happened.

    Further, we no longer get a ref on the connection looked up in the
    data_ready handler unless we queue the connection's work item - and then
    only if the refcount > 0.

    Note that I'm avoiding the use of a hash table for service connections
    because each service connection is addressed by a 62-bit number
    (constructed from epoch and connection ID >> 2) that would allow the client
    to engage in bucket stuffing, given knowledge of the hash algorithm.
    Peers, however, are hashed as the network address is less controllable by
    the client. The total number of peers will also be limited in a future
    commit.

    Signed-off-by: David Howells

    David Howells