19 Mar, 2019

1 commit

  • [ Upstream commit 69ffaebb90369ce08657b5aea4896777b9d6e8fc ]

    rxrpc_get_client_conn() adds a new call to the front of the waiting_calls
    queue if the connection it's going to use already exists. This is bad as
    it allows calls to get starved out.

    Fix this by adding to the tail instead.

    Also change the other enqueue point in the same function to put it on the
    front (ie. when we have a new connection). This makes the point that in
    the case of a new connection the new call goes at the front (though it
    doesn't actually matter since the queue should be unoccupied).

    Fixes: 45025bceef17 ("rxrpc: Improve management and caching of client connection objects")
    Signed-off-by: David Howells
    Reviewed-by: Marc Dionne
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     

13 Feb, 2019

1 commit

  • [ Upstream commit 6dce3c20ac429e7a651d728e375853370c796e8d ]

    When either "goto wait_interrupted;" or "goto wait_error;"
    paths are taken, socket lock has already been released.

    This patch fixes following syzbot splat :

    WARNING: bad unlock balance detected!
    5.0.0-rc4+ #59 Not tainted
    -------------------------------------
    syz-executor223/8256 is trying to release lock (sk_lock-AF_RXRPC) at:
    [] rxrpc_recvmsg+0x6d3/0x3099 net/rxrpc/recvmsg.c:598
    but there are no more locks to release!

    other info that might help us debug this:
    1 lock held by syz-executor223/8256:
    #0: 00000000fa9ed0f4 (slock-AF_RXRPC){+...}, at: spin_lock_bh include/linux/spinlock.h:334 [inline]
    #0: 00000000fa9ed0f4 (slock-AF_RXRPC){+...}, at: release_sock+0x20/0x1c0 net/core/sock.c:2798

    stack backtrace:
    CPU: 1 PID: 8256 Comm: syz-executor223 Not tainted 5.0.0-rc4+ #59
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x172/0x1f0 lib/dump_stack.c:113
    print_unlock_imbalance_bug kernel/locking/lockdep.c:3391 [inline]
    print_unlock_imbalance_bug.cold+0x114/0x123 kernel/locking/lockdep.c:3368
    __lock_release kernel/locking/lockdep.c:3601 [inline]
    lock_release+0x67e/0xa00 kernel/locking/lockdep.c:3860
    sock_release_ownership include/net/sock.h:1471 [inline]
    release_sock+0x183/0x1c0 net/core/sock.c:2808
    rxrpc_recvmsg+0x6d3/0x3099 net/rxrpc/recvmsg.c:598
    sock_recvmsg_nosec net/socket.c:794 [inline]
    sock_recvmsg net/socket.c:801 [inline]
    sock_recvmsg+0xd0/0x110 net/socket.c:797
    __sys_recvfrom+0x1ff/0x350 net/socket.c:1845
    __do_sys_recvfrom net/socket.c:1863 [inline]
    __se_sys_recvfrom net/socket.c:1859 [inline]
    __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:1859
    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x446379
    Code: e8 2c b3 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 2b 09 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007fe5da89fd98 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
    RAX: ffffffffffffffda RBX: 00000000006dbc28 RCX: 0000000000446379
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
    RBP: 00000000006dbc20 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
    R13: 0000000000000000 R14: 0000000000000000 R15: 20c49ba5e353f7cf

    Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
    Signed-off-by: Eric Dumazet
    Cc: David Howells
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

23 Nov, 2018

1 commit

  • [ Upstream commit c7e86acfcee30794dc99a0759924bf7b9d43f1ca ]

    If the network becomes (partially) unavailable, say by disabling IPv6, the
    background ACK transmission routine can get itself into a tizzy by
    proposing immediate ACK retransmission. Since we're in the call event
    processor, that happens immediately without returning to the workqueue
    manager.

    The condition should clear after a while when either the network comes back
    or the call times out.

    Fix this by:

    (1) When re-proposing an ACK on failed Tx, don't schedule it immediately.
    This will allow a certain amount of time to elapse before we try
    again.

    (2) Enforce a return to the workqueue manager after a certain number of
    iterations of the call processing loop.

    (3) Add a backoff delay that increases the delay on deferred ACKs by a
    jiffy per failed transmission to a limit of HZ. The backoff delay is
    cleared on a successful return from kernel_sendmsg().

    (4) Cancel calls immediately if the opening sendmsg fails. The layer
    above can arrange retransmission or rotate to another server.

    Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     

04 Nov, 2018

1 commit

  • [ Upstream commit 89ab066d4229acd32e323f1569833302544a4186 ]

    This reverts commit dd979b4df817e9976f18fb6f9d134d6bc4a3c317.

    This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
    internal TCP socket for the initial handshake with the remote peer.
    Whenever the SMC connection can not be established this TCP socket is
    used as a fallback. All socket operations on the SMC socket are then
    forwarded to the TCP socket. In case of poll, the file->private_data
    pointer references the SMC socket because the TCP socket has no file
    assigned. This causes tcp_poll to wait on the wrong socket.

    Signed-off-by: Karsten Graul
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Karsten Graul
     

16 Oct, 2018

4 commits

  • Fix a missing call to rxrpc_put_peer() on the main path through the
    rxrpc_error_report() function. This manifests itself as a ref leak
    whenever an ICMP packet or other error comes in.

    In commit f334430316e7, the hand-off of the ref to a work item was removed
    and was not replaced with a put.

    Fixes: f334430316e7 ("rxrpc: Fix error distribution")
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     
  • The udpv6_encap_enable() function is part of the ipv6 code, and if that is
    configured as a loadable module and rxrpc is built in then a build failure
    will occur because the conditional check is wrong:

    net/rxrpc/local_object.o: In function `rxrpc_lookup_local':
    local_object.c:(.text+0x2688): undefined reference to `udpv6_encap_enable'

    Use the correct config symbol (CONFIG_AF_RXRPC_IPV6) in the conditional
    check rather than CONFIG_IPV6 as that will do the right thing.

    Fixes: 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook")
    Reported-by: kbuild-all@01.org
    Reported-by: Arnd Bergmann
    Signed-off-by: David Howells
    Reviewed-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    David Howells
     
  • Fixes gcc '-Wunused-but-set-variable' warning:

    net/rxrpc/output.c: In function 'rxrpc_reject_packets':
    net/rxrpc/output.c:527:11: warning:
    variable 'ioc' set but not used [-Wunused-but-set-variable]

    'ioc' is the correct kvec num when sending a BUSY (or an ABORT) response
    packet.

    Fixes: ece64fec164f ("rxrpc: Emit BUSY packets when supposed to rather than ABORTs")
    Signed-off-by: YueHaibing
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    YueHaibing
     
  • Fix an uninitialised variable introduced by the last patch. This can cause
    a crash when a new call comes in to a local service, such as when an AFS
    fileserver calls back to the local cache manager.

    Fixes: c1e15b4944c9 ("rxrpc: Fix the packet reception routine")
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

09 Oct, 2018

3 commits

  • The rxrpc_input_packet() function and its call tree was built around the
    assumption that data_ready() handler called from UDP to inform a kernel
    service that there is data to be had was non-reentrant. This means that
    certain locking could be dispensed with.

    This, however, turns out not to be the case with a multi-queue network card
    that can deliver packets to multiple cpus simultaneously. Each of those
    cpus can be in the rxrpc_input_packet() function at the same time.

    Fix by adding or changing some structure members:

    (1) Add peer->rtt_input_lock to serialise access to the RTT buffer.

    (2) Make conn->service_id into a 32-bit variable so that it can be
    cmpxchg'd on all arches.

    (3) Add call->input_lock to serialise access to the Rx/Tx state. Note
    that although the Rx and Tx states are (almost) entirely separate,
    there's no point completing the separation and having separate locks
    since it's a bi-phasal RPC protocol rather than a bi-direction
    streaming protocol. Data transmission and data reception do not take
    place simultaneously on any particular call.

    and making the following functional changes:

    (1) In rxrpc_input_data(), hold call->input_lock around the core to
    prevent simultaneous producing of packets into the Rx ring and
    updating of tracking state for a particular call.

    (2) In rxrpc_input_ping_response(), only read call->ping_serial once, and
    check it before checking RXRPC_CALL_PINGING as that's a cheaper test.
    The bit test and bit clear can then be combined. No further locking
    is needed here.

    (3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of
    the ACK packet. The superseded ACK check is then done both before and
    after the lock is taken.

    The handing of ackinfo data is split, parsing before the lock is taken
    and processing with it held. This is keyed on rxMTU being non-zero.

    Congestion management is also done within the locked section.

    (4) In rxrpc_input_ackall(), take call->input_lock around the Tx window
    rotation. The ACKALL packet carries no information and is only really
    useful after all packets have been transmitted since it's imprecise.

    (5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to
    prevent calls being simultaneously implicitly ended on two cpus and
    also to prevent any races with incoming call setup.

    (6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade
    on a connection. It is only permitted to happen once for a
    connection.

    (7) In rxrpc_new_incoming_call(), we have to recheck the routing inside
    rx->incoming_lock to see if someone else set up the call, connection
    or peer whilst we were getting there. We can't trust the values from
    the earlier routing check unless we pin refs on them - which we want
    to avoid.

    Further, we need to allow for an incoming call to have its state
    changed on another CPU between us making it live and us adjusting it
    because the conn is now in the RXRPC_CONN_SERVICE state.

    (8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access
    to the RTT buffer. Don't need to lock around setting peer->rtt.

    For reference, the inventory of state-accessing or state-altering functions
    used by the packet input procedure is:

    > rxrpc_input_packet()
    * PACKET CHECKING

    * ROUTING
    > rxrpc_post_packet_to_local()
    > rxrpc_find_connection_rcu() - uses RCU
    > rxrpc_lookup_peer_rcu() - uses RCU
    > rxrpc_find_service_conn_rcu() - uses RCU
    > idr_find() - uses RCU

    * CONNECTION-LEVEL PROCESSING
    - Service upgrade
    - Can only happen once per conn
    ! Changed to use cmpxchg
    > rxrpc_post_packet_to_conn()
    - Setting conn->hi_serial
    - Probably safe not using locks
    - Maybe use cmpxchg

    * CALL-LEVEL PROCESSING
    > Old-call checking
    > rxrpc_input_implicit_end_call()
    > rxrpc_call_completed()
    > rxrpc_queue_call()
    ! Need to take rx->incoming_lock
    > __rxrpc_disconnect_call()
    > rxrpc_notify_socket()
    > rxrpc_new_incoming_call()
    - Uses rx->incoming_lock for the entire process
    - Might be able to drop this earlier in favour of the call lock
    > rxrpc_incoming_call()
    ! Conflicts with rxrpc_input_implicit_end_call()
    > rxrpc_send_ping()
    - Don't need locks to check rtt state
    > rxrpc_propose_ACK

    * PACKET DISTRIBUTION
    > rxrpc_input_call_packet()
    > rxrpc_input_data()
    * QUEUE DATA PACKET ON CALL
    > rxrpc_reduce_call_timer()
    - Uses timer_reduce()
    ! Needs call->input_lock()
    > rxrpc_receiving_reply()
    ! Needs locking around ack state
    > rxrpc_rotate_tx_window()
    > rxrpc_end_tx_phase()
    > rxrpc_proto_abort()
    > rxrpc_input_dup_data()
    - Fills the Rx buffer
    - rxrpc_propose_ACK()
    - rxrpc_notify_socket()

    > rxrpc_input_ack()
    * APPLY ACK PACKET TO CALL AND DISCARD PACKET
    > rxrpc_input_ping_response()
    - Probably doesn't need any extra locking
    ! Need READ_ONCE() on call->ping_serial
    > rxrpc_input_check_for_lost_ack()
    - Takes call->lock to consult Tx buffer
    > rxrpc_peer_add_rtt()
    ! Needs to take a lock (peer->rtt_input_lock)
    ! Could perhaps manage with cmpxchg() and xadd() instead
    > rxrpc_input_requested_ack
    - Consults Tx buffer
    ! Probably needs a lock
    > rxrpc_peer_add_rtt()
    > rxrpc_propose_ack()
    > rxrpc_input_ackinfo()
    - Changes call->tx_winsize
    ! Use cmpxchg to handle change
    ! Should perhaps track serial number
    - Uses peer->lock to record MTU specification changes
    > rxrpc_proto_abort()
    ! Need to take call->input_lock
    > rxrpc_rotate_tx_window()
    > rxrpc_end_tx_phase()
    > rxrpc_input_soft_acks()
    - Consults the Tx buffer
    > rxrpc_congestion_management()
    - Modifies the Tx annotations
    ! Needs call->input_lock()
    > rxrpc_queue_call()

    > rxrpc_input_abort()
    * APPLY ABORT PACKET TO CALL AND DISCARD PACKET
    > rxrpc_set_call_completion()
    > rxrpc_notify_socket()

    > rxrpc_input_ackall()
    * APPLY ACKALL PACKET TO CALL AND DISCARD PACKET
    ! Need to take call->input_lock
    > rxrpc_rotate_tx_window()
    > rxrpc_end_tx_phase()

    > rxrpc_reject_packet()

    There are some functions used by the above that queue the packet, after
    which the procedure is terminated:

    - rxrpc_post_packet_to_local()
    - local->event_queue is an sk_buff_head
    - local->processor is a work_struct
    - rxrpc_post_packet_to_conn()
    - conn->rx_queue is an sk_buff_head
    - conn->processor is a work_struct
    - rxrpc_reject_packet()
    - local->reject_queue is an sk_buff_head
    - local->processor is a work_struct

    And some that offload processing to process context:

    - rxrpc_notify_socket()
    - Uses RCU lock
    - Uses call->notify_lock to call call->notify_rx
    - Uses call->recvmsg_lock to queue recvmsg side
    - rxrpc_queue_call()
    - call->processor is a work_struct
    - rxrpc_propose_ACK()
    - Uses call->lock to wrap __rxrpc_propose_ACK()

    And a bunch that complete a call, all of which use call->state_lock to
    protect the call state:

    - rxrpc_call_completed()
    - rxrpc_set_call_completion()
    - rxrpc_abort_call()
    - rxrpc_proto_abort()
    - Also uses rxrpc_queue_call()

    Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
    Signed-off-by: David Howells

    David Howells
     
  • Fix connection-level abort handling to cache the abort and error codes
    properly so that a new incoming call can be properly aborted if it races
    with the parent connection being aborted by another CPU.

    The abort_code and error parameters can then be dropped from
    rxrpc_abort_calls().

    Fixes: f5c17aaeb2ae ("rxrpc: Calls should only have one terminal state")
    Signed-off-by: David Howells

    David Howells
     
  • Move the out-of-order and duplicate ACK packet check to before the call to
    rxrpc_input_ackinfo() so that the receive window size and MTU size are only
    checked in the latest ACK packet and don't regress.

    Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
    Signed-off-by: David Howells

    David Howells
     

08 Oct, 2018

4 commits

  • Carry the call state out of the locked section in rxrpc_rotate_tx_window()
    rather than sampling it afterwards. This is only used to select tracepoint
    data, but could have changed by the time we do the tracepoint.

    Signed-off-by: David Howells

    David Howells
     
  • We should only call the function to end a call's Tx phase if we rotated the
    marked-last packet out of the transmission buffer.

    Make rxrpc_rotate_tx_window() return an indication of whether it just
    rotated the packet marked as the last out of the transmit buffer, carrying
    the information out of the locked section in that function.

    We can then check the return value instead of examining RXRPC_CALL_TX_LAST.

    Fixes: 70790dbe3f66 ("rxrpc: Pass the last Tx packet marker in the annotation buffer")
    Signed-off-by: David Howells

    David Howells
     
  • We don't need to take the RCU read lock in the rxrpc packet receive
    function because it's held further up the stack in the IP input routine
    around the UDP receive routines.

    Fix this by dropping the RCU read lock calls from rxrpc_input_packet().
    This simplifies the code.

    Fixes: 70790dbe3f66 ("rxrpc: Pass the last Tx packet marker in the annotation buffer")
    Signed-off-by: David Howells

    David Howells
     
  • Use the UDP encap_rcv hook to cut the bit out of the rxrpc packet reception
    in which a packet is placed onto the UDP receive queue and then immediately
    removed again by rxrpc. Going via the queue in this manner seems like it
    should be unnecessary.

    This does, however, require the invention of a value to place in encap_type
    as that's one of the conditions to switch packets out to the encap_rcv
    hook. Possibly the value doesn't actually matter for anything other than
    sockopts on the UDP socket, which aren't accessible outside of rxrpc
    anyway.

    This seems to cut a bit of time out of the time elapsed between each
    sk_buff being timestamped and turning up in rxrpc (the final number in the
    following trace excerpts). I measured this by making the rxrpc_rx_packet
    trace point print the time elapsed between the skb being timestamped and
    the current time (in ns), e.g.:

    ... 424.278721: rxrpc_rx_packet: ... ACK 25026

    So doing a 512MiB DIO read from my test server, with an unmodified kernel:

    N min max sum mean stddev
    27605 2626 7581 7.83992e+07 2840.04 181.029

    and with the patch applied:

    N min max sum mean stddev
    27547 1895 12165 6.77461e+07 2459.29 255.02

    Signed-off-by: David Howells

    David Howells
     

05 Oct, 2018

2 commits

  • Fix the rxrpc_data_ready() function to pick up all packets and to not miss
    any. There are two problems:

    (1) The sk_data_ready pointer on the UDP socket is set *after* it is
    bound. This means that it's open for business before we're ready to
    dequeue packets and there's a tiny window exists in which a packet can
    sneak onto the receive queue, but we never know about it.

    Fix this by setting the pointers on the socket prior to binding it.

    (2) skb_recv_udp() will return an error (such as ENETUNREACH) if there was
    an error on the transmission side, even though we set the
    sk_error_report hook. Because rxrpc_data_ready() returns immediately
    in such a case, it never actually removes its packet from the receive
    queue.

    Fix this by abstracting out the UDP dequeuing and checksumming into a
    separate function that keeps hammering on skb_recv_udp() until it
    returns -EAGAIN, passing the packets extracted to the remainder of the
    function.

    and two potential problems:

    (3) It might be possible in some circumstances or in the future for
    packets to be being added to the UDP receive queue whilst rxrpc is
    running consuming them, so the data_ready() handler might get called
    less often than once per packet.

    Allow for this by fully draining the queue on each call as (2).

    (4) If a packet fails the checksum check, the code currently returns after
    discarding the packet without checking for more.

    Allow for this by fully draining the queue on each call as (2).

    Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
    Signed-off-by: David Howells
    Acked-by: Paolo Abeni

    David Howells
     
  • Fix some refs to init_net that should've been changed to the appropriate
    network namespace.

    Fixes: 2baec2c3f854 ("rxrpc: Support network namespacing")
    Signed-off-by: David Howells
    Acked-by: Paolo Abeni

    David Howells
     

28 Sep, 2018

7 commits

  • Fix error distribution by immediately delivering the errors to all the
    affected calls rather than deferring them to a worker thread. The problem
    with the latter is that retries and things can happen in the meantime when we
    want to stop that sooner.

    To this end:

    (1) Stop the error distributor from removing calls from the error_targets
    list so that peer->lock isn't needed to synchronise against other adds
    and removals.

    (2) Require the peer's error_targets list to be accessed with RCU, thereby
    avoiding the need to take peer->lock over distribution.

    (3) Don't attempt to affect a call's state if it is already marked complete.

    Signed-off-by: David Howells

    David Howells
     
  • It seems that enabling IPV6_RECVERR on an IPv6 socket doesn't also turn on
    IP_RECVERR, so neither local errors nor ICMP-transported remote errors from
    IPv4 peer addresses are returned to the AF_RXRPC protocol.

    Make the sockopt setting code in rxrpc_open_socket() fall through from the
    AF_INET6 case to the AF_INET case to turn on all the AF_INET options too in
    the AF_INET6 case.

    Fixes: f2aeed3a591f ("rxrpc: Fix error reception on AF_INET6 sockets")
    Signed-off-by: David Howells

    David Howells
     
  • Make the following changes to improve the robustness of the code that sets
    up a new service call:

    (1) Cache the rxrpc_sock struct obtained in rxrpc_data_ready() to do a
    service ID check and pass that along to rxrpc_new_incoming_call().
    This means that I can remove the check from rxrpc_new_incoming_call()
    without the need to worry about the socket attached to the local
    endpoint getting replaced - which would invalidate the check.

    (2) Cache the rxrpc_peer struct, thereby allowing the peer search to be
    done once. The peer is passed to rxrpc_new_incoming_call(), thereby
    saving the need to repeat the search.

    This also reduces the possibility of rxrpc_publish_service_conn()
    BUG()'ing due to the detection of a duplicate connection, despite the
    initial search done by rxrpc_find_connection_rcu() having turned up
    nothing.

    This BUG() shouldn't ever get hit since rxrpc_data_ready() *should* be
    non-reentrant and the result of the initial search should still hold
    true, but it has proven possible to hit.

    I *think* this may be due to __rxrpc_lookup_peer_rcu() cutting short
    the iteration over the hash table if it finds a matching peer with a
    zero usage count, but I don't know for sure since it's only ever been
    hit once that I know of.

    Another possibility is that a bug in rxrpc_data_ready() that checked
    the wrong byte in the header for the RXRPC_CLIENT_INITIATED flag
    might've let through a packet that caused a spurious and invalid call
    to be set up. That is addressed in another patch.

    (3) Fix __rxrpc_lookup_peer_rcu() to skip peer records that have a zero
    usage count rather than stopping and returning not found, just in case
    there's another peer record behind it in the bucket.

    (4) Don't search the peer records in rxrpc_alloc_incoming_call(), but
    rather either use the peer cached in (2) or, if one wasn't found,
    preemptively install a new one.

    Fixes: 8496af50eb38 ("rxrpc: Use RCU to access a peer's service connection tree")
    Signed-off-by: David Howells

    David Howells
     
  • Do more up-front checking on incoming packets to weed out invalid ones and
    also ones aimed at services that we don't support.

    Whilst we're at it, replace the clearing of call and skew if we don't find
    a connection with just initialising the variables to zero at the top of the
    function.

    Signed-off-by: David Howells

    David Howells
     
  • In the input path, a received sk_buff can be marked for rejection by
    setting RXRPC_SKB_MARK_* in skb->mark and, if needed, some auxiliary data
    (such as an abort code) in skb->priority. The rejection is handled by
    queueing the sk_buff up for dealing with in process context. The output
    code reads the mark and priority and, theoretically, generates an
    appropriate response packet.

    However, if RXRPC_SKB_MARK_BUSY is set, this isn't noticed and an ABORT
    message with a random abort code is generated (since skb->priority wasn't
    set to anything).

    Fix this by outputting the appropriate sort of packet.

    Also, whilst we're at it, most of the marks are no longer used, so remove
    them and rename the remaining two to something more obvious.

    Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
    Signed-off-by: David Howells

    David Howells
     
  • Fix RTT information gathering in AF_RXRPC by the following means:

    (1) Enable Rx timestamping on the transport socket with SO_TIMESTAMPNS.

    (2) If the sk_buff doesn't have a timestamp set when rxrpc_data_ready()
    collects it, set it at that point.

    (3) Allow ACKs to be requested on the last packet of a client call, but
    not a service call. We need to be careful lest we undo:

    bf7d620abf22c321208a4da4f435e7af52551a21
    Author: David Howells
    Date: Thu Oct 6 08:11:51 2016 +0100
    rxrpc: Don't request an ACK on the last DATA packet of a call's Tx phase

    but that only really applies to service calls that we're handling,
    since the client side gets to send the final ACK (or not).

    (4) When about to transmit an ACK or DATA packet, record the Tx timestamp
    before only; don't update the timestamp afterwards.

    (5) Switch the ordering between recording the serial and recording the
    timestamp to always set the serial number first. The serial number
    shouldn't be seen referenced by an ACK packet until we've transmitted
    the packet bearing it - so in the Rx path, we don't need the timestamp
    until we've checked the serial number.

    Fixes: cf1a6474f807 ("rxrpc: Add per-peer RTT tracker")
    Signed-off-by: David Howells

    David Howells
     
  • There's a check in rxrpc_data_ready() that's checking the CLIENT_INITIATED
    flag in the packet type field rather than in the packet flags field.

    Fix this by creating a pair of helper functions to check whether the packet
    is going to the client or to the server and use them generally.

    Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
    Signed-off-by: David Howells

    David Howells
     

27 Sep, 2018

1 commit


16 Aug, 2018

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    - Gustavo A. R. Silva keeps working on the implicit switch fallthru
    changes.

    - Support 802.11ax High-Efficiency wireless in cfg80211 et al, From
    Luca Coelho.

    - Re-enable ASPM in r8169, from Kai-Heng Feng.

    - Add virtual XFRM interfaces, which avoids all of the limitations of
    existing IPSEC tunnels. From Steffen Klassert.

    - Convert GRO over to use a hash table, so that when we have many
    flows active we don't traverse a long list during accumluation.

    - Many new self tests for routing, TC, tunnels, etc. Too many
    contributors to mention them all, but I'm really happy to keep
    seeing this stuff.

    - Hardware timestamping support for dpaa_eth/fsl-fman from Yangbo Lu.

    - Lots of cleanups and fixes in L2TP code from Guillaume Nault.

    - Add IPSEC offload support to netdevsim, from Shannon Nelson.

    - Add support for slotting with non-uniform distribution to netem
    packet scheduler, from Yousuk Seung.

    - Add UDP GSO support to mlx5e, from Boris Pismenny.

    - Support offloading of Team LAG in NFP, from John Hurley.

    - Allow to configure TX queue selection based upon RX queue, from
    Amritha Nambiar.

    - Support ethtool ring size configuration in aquantia, from Anton
    Mikaev.

    - Support DSCP and flowlabel per-transport in SCTP, from Xin Long.

    - Support list based batching and stack traversal of SKBs, this is
    very exciting work. From Edward Cree.

    - Busyloop optimizations in vhost_net, from Toshiaki Makita.

    - Introduce the ETF qdisc, which allows time based transmissions. IGB
    can offload this in hardware. From Vinicius Costa Gomes.

    - Add parameter support to devlink, from Moshe Shemesh.

    - Several multiplication and division optimizations for BPF JIT in
    nfp driver, from Jiong Wang.

    - Lots of prepatory work to make more of the packet scheduler layer
    lockless, when possible, from Vlad Buslov.

    - Add ACK filter and NAT awareness to sch_cake packet scheduler, from
    Toke Høiland-Jørgensen.

    - Support regions and region snapshots in devlink, from Alex Vesker.

    - Allow to attach XDP programs to both HW and SW at the same time on
    a given device, with initial support in nfp. From Jakub Kicinski.

    - Add TLS RX offload and support in mlx5, from Ilya Lesokhin.

    - Use PHYLIB in r8169 driver, from Heiner Kallweit.

    - All sorts of changes to support Spectrum 2 in mlxsw driver, from
    Ido Schimmel.

    - PTP support in mv88e6xxx DSA driver, from Andrew Lunn.

    - Make TCP_USER_TIMEOUT socket option more accurate, from Jon
    Maxwell.

    - Support for templates in packet scheduler classifier, from Jiri
    Pirko.

    - IPV6 support in RDS, from Ka-Cheong Poon.

    - Native tproxy support in nf_tables, from Máté Eckl.

    - Maintain IP fragment queue in an rbtree, but optimize properly for
    in-order frags. From Peter Oskolkov.

    - Improvde handling of ACKs on hole repairs, from Yuchung Cheng"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1996 commits)
    bpf: test: fix spelling mistake "REUSEEPORT" -> "REUSEPORT"
    hv/netvsc: Fix NULL dereference at single queue mode fallback
    net: filter: mark expected switch fall-through
    xen-netfront: fix warn message as irq device name has '/'
    cxgb4: Add new T5 PCI device ids 0x50af and 0x50b0
    net: dsa: mv88e6xxx: missing unlock on error path
    rds: fix building with IPV6=m
    inet/connection_sock: prefer _THIS_IP_ to current_text_addr
    net: dsa: mv88e6xxx: bitwise vs logical bug
    net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd()
    ieee802154: hwsim: using right kind of iteration
    net: hns3: Add vlan filter setting by ethtool command -K
    net: hns3: Set tx ring' tc info when netdev is up
    net: hns3: Remove tx ring BD len register in hns3_enet
    net: hns3: Fix desc num set to default when setting channel
    net: hns3: Fix for phy link issue when using marvell phy driver
    net: hns3: Fix for information of phydev lost problem when down/up
    net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero
    net: hns3: Add support for serdes loopback selftest
    bnxt_en: take coredump_record structure off stack
    ...

    Linus Torvalds
     

14 Aug, 2018

1 commit

  • Pull locking/atomics update from Thomas Gleixner:
    "The locking, atomics and memory model brains delivered:

    - A larger update to the atomics code which reworks the ordering
    barriers, consolidates the atomic primitives, provides the new
    atomic64_fetch_add_unless() primitive and cleans up the include
    hell.

    - Simplify cmpxchg() instrumentation and add instrumentation for
    xchg() and cmpxchg_double().

    - Updates to the memory model and documentation"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
    locking/atomics: Rework ordering barriers
    locking/atomics: Instrument cmpxchg_double*()
    locking/atomics: Instrument xchg()
    locking/atomics: Simplify cmpxchg() instrumentation
    locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
    tools/memory-model: Rename litmus tests to comply to norm7
    tools/memory-model/Documentation: Fix typo, smb->smp
    sched/Documentation: Update wake_up() & co. memory-barrier guarantees
    locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
    sched/core: Use smp_mb() in wake_woken_function()
    tools/memory-model: Add informal LKMM documentation to MAINTAINERS
    locking/atomics/Documentation: Describe atomic_set() as a write operation
    tools/memory-model: Make scripts executable
    tools/memory-model: Remove ACCESS_ONCE() from model
    tools/memory-model: Remove ACCESS_ONCE() from recipes
    locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
    MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
    tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
    tools/memory-model: Add litmus test for full multicopy atomicity
    locking/refcount: Always allow checked forms
    ...

    Linus Torvalds
     

12 Aug, 2018

1 commit

  • The static int 'zero' is defined but is never used hence it is
    redundant and can be removed. The use of this variable was removed
    with commit a158bdd3247b ("rxrpc: Fix call timeouts").

    Cleans up clang warning:
    warning: 'zero' defined but not used [-Wunused-const-variable=]

    Signed-off-by: Colin Ian King
    Signed-off-by: David S. Miller

    Colin Ian King
     

10 Aug, 2018

1 commit


09 Aug, 2018

1 commit

  • AF_RXRPC has a keepalive message generator that generates a message for a
    peer ~20s after the last transmission to that peer to keep firewall ports
    open. The implementation is incorrect in the following ways:

    (1) It mixes up ktime_t and time64_t types.

    (2) It uses ktime_get_real(), the output of which may jump forward or
    backward due to adjustments to the time of day.

    (3) If the current time jumps forward too much or jumps backwards, the
    generator function will crank the base of the time ring round one slot
    at a time (ie. a 1s period) until it catches up, spewing out VERSION
    packets as it goes.

    Fix the problem by:

    (1) Only using time64_t. There's no need for sub-second resolution.

    (2) Use ktime_get_seconds() rather than ktime_get_real() so that time
    isn't perceived to go backwards.

    (3) Simplifying rxrpc_peer_keepalive_worker() by splitting it into two
    parts:

    (a) The "worker" function that manages the buckets and the timer.

    (b) The "dispatch" function that takes the pending peers and
    potentially transmits a keepalive packet before putting them back
    in the ring into the slot appropriate to the revised last-Tx time.

    (4) Taking everything that's pending out of the ring and splicing it into
    a temporary collector list for processing.

    In the case that there's been a significant jump forward, the ring
    gets entirely emptied and then the time base can be warped forward
    before the peers are processed.

    The warping can't happen if the ring isn't empty because the slot a
    peer is in is keepalive-time dependent, relative to the base time.

    (5) Limit the number of iterations of the bucket array when scanning it.

    (6) Set the timer to skip any empty slots as there's no point waking up if
    there's nothing to do yet.

    This can be triggered by an incoming call from a server after a reboot with
    AF_RXRPC and AFS built into the kernel causing a peer record to be set up
    before userspace is started. The system clock is then adjusted by
    userspace, thereby potentially causing the keepalive generator to have a
    meltdown - which leads to a message like:

    watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:23]
    ...
    Workqueue: krxrpcd rxrpc_peer_keepalive_worker
    EIP: lock_acquire+0x69/0x80
    ...
    Call Trace:
    ? rxrpc_peer_keepalive_worker+0x5e/0x350
    ? _raw_spin_lock_bh+0x29/0x60
    ? rxrpc_peer_keepalive_worker+0x5e/0x350
    ? rxrpc_peer_keepalive_worker+0x5e/0x350
    ? __lock_acquire+0x3d3/0x870
    ? process_one_work+0x110/0x340
    ? process_one_work+0x166/0x340
    ? process_one_work+0x110/0x340
    ? worker_thread+0x39/0x3c0
    ? kthread+0xdb/0x110
    ? cancel_delayed_work+0x90/0x90
    ? kthread_stop+0x70/0x70
    ? ret_from_fork+0x19/0x24

    Fixes: ace45bec6d77 ("rxrpc: Fix firewall route keepalive")
    Reported-by: kernel test robot
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

04 Aug, 2018

2 commits

  • Push iov_iter up from rxrpc_kernel_recv_data() to its caller to allow
    non-contiguous iovs to be passed down, thereby permitting file reading to
    be simplified in the AFS filesystem in a future patch.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     
  • The use of SKCIPHER_REQUEST_ON_STACK() will trigger FRAME_WARN warnings
    (when less than 2048) once the VLA is no longer hidden from the check:

    net/rxrpc/rxkad.c:398:1: warning: the frame size of 1152 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    net/rxrpc/rxkad.c:242:1: warning: the frame size of 1152 bytes is larger than 1024 bytes [-Wframe-larger-than=]

    This passes the initial SKCIPHER_REQUEST_ON_STACK allocation to the leaf
    functions for reuse. Two requests allocated on the stack is not needed
    when only one is used at a time.

    Signed-off-by: Kees Cook
    Acked-by: Arnd Bergmann
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    Kees Cook
     

03 Aug, 2018

2 commits


02 Aug, 2018

1 commit

  • There just check the user call ID isn't already in use, hence should
    compare user_call_ID with xcall->user_call_ID, which is current
    node's user_call_ID.

    Fixes: 540b1c48c37a ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg")
    Suggested-by: David Howells
    Signed-off-by: YueHaibing
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    YueHaibing
     

01 Aug, 2018

5 commits

  • Immediately flush any outstanding ACK on entry to rxrpc_recvmsg_data() -
    which transfers data to the target buffers - if we previously had an Rx
    underrun (ie. we returned -EAGAIN because we ran out of received data).
    This lets the server know what we've managed to receive something.

    Also flush any outstanding ACK after calling the function if it hit -EAGAIN
    to let the server know we processed some data.

    It might be better to send more ACKs, possibly on a time-based scheme, but
    that needs some more consideration.

    With this and some additional AFS patches, it is possible to get large
    unencrypted O_DIRECT reads to be almost as fast as NFS over TCP. It looks
    like it might be theoretically possible to improve performance yet more for
    a server running a single operation as investigation of packet timestamps
    indicates that the server keeps stalling.

    The issue appears to be that rxrpc runs in to trouble with ACK packets
    getting batched together (up to ~32 at a time) somewhere between the IP
    transmit queue on the client and the ethernet receive queue on the server.

    However, this case isn't too much of a worry as even a lightly loaded
    server should be receiving sufficient packet flux to flush the ACK packets
    to the UDP socket.

    Signed-off-by: David Howells

    David Howells
     
  • The final ACK that closes out an rxrpc call needs to be transmitted by the
    client unless we're going to follow up with a DATA packet for a new call on
    the same channel (which implicitly ACK's the previous call, thereby saving
    an ACK).

    Currently, we don't do that, so if no follow on call is immediately
    forthcoming, the server will resend the last DATA packet - at which point
    rxrpc_conn_retransmit_call() will be triggered and will (re)send the final
    ACK. But the server has to hold on to the last packet until the ACK is
    received, thereby holding up its resources.

    Fix the client side to propose a delayed final ACK, to be transmitted after
    a short delay, assuming the call isn't superseded by a new one.

    Signed-off-by: David Howells

    David Howells
     
  • Increase the size of a call's Rx window from 32 to 63 - ie. one less than
    the size of the ring buffer. This makes large data transfers perform
    better when the Tx window on the other side is around 64 (as is the case
    with Auristor's YFS fileserver).

    If the server window size is ~32 or smaller, this should make no
    difference.

    Signed-off-by: David Howells

    David Howells
     
  • Trace notifications from the softirq side of the socket to the
    process-context side.

    Signed-off-by: David Howells

    David Howells
     
  • Trace successful packet transmission (kernel_sendmsg() succeeded, that is)
    in AF_RXRPC. We can share the enum that defines the transmission points
    with the trace_rxrpc_tx_fail() tracepoint, so rename its constants to be
    applicable to both.

    Also, save the internal call->debug_id in the rxrpc_channel struct so that
    it can be used in retransmission trace lines.

    Signed-off-by: David Howells

    David Howells