16 Jul, 2016

3 commits

  • Use RDS probe-ping to compute how many paths may be used with
    the peer, and to synchronously start the multiple paths. If mprds is
    supported, hash outgoing traffic to one of multiple paths in rds_sendmsg()
    when multipath RDS is supported by the transport.

    CC: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • Some code duplication in rds_tcp_reset_callbacks() can be avoided
    by having the function call rds_tcp_restore_callbacks() and
    rds_tcp_set_callbacks().

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • As the existing comments in rds_tcp_listen_data_ready() indicate,
    it is possible under some race-windows to get to this function with the
    accept() socket. If that happens, we could run into a sequence whereby

    thread 1 thread 2

    rds_tcp_accept_one() thread
    sets up new_sock via ->accept().
    The sk_user_data is now
    sock_def_readable
    data comes in for new_sock,
    ->sk_data_ready is called, and
    we land in rds_tcp_listen_data_ready
    rds_tcp_set_callbacks()
    takes the sk_callback_lock and
    sets up sk_user_data to be the cp
    read_lock sk_callback_lock
    ready = cp
    unlock sk_callback_lock
    page fault on ready

    In the above sequence, we end up with a panic on a bad page reference
    when trying to execute (*ready)(). Instead we need to call
    sock_def_readable() safely, which is what this patch achieves.

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

07 Jul, 2016

1 commit


05 Jul, 2016

1 commit

  • If register_pernet_subsys() fails, we shouldn't try to call
    unregister_pernet_subsys().

    Fixes: 467fa15356 ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
    Cc: stable@vger.kernel.org
    Cc: Sowmini Varadhan
    Cc: David S. Miller
    Signed-off-by: Vegard Nossum
    Acked-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Vegard Nossum
     

02 Jul, 2016

7 commits

  • This patch adds ->conn_path_connect callbacks in the rds_transport
    that are used to set up a single connection path.

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • The ->sk_user_data contains a pointer to the rds_conn_path
    for the socket. Use this consistently in the rds_tcp_data_ready
    callbacks to get the rds_conn_path for rds_recv_incoming.

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • The socket callbacks should all operate on a struct rds_conn_path,
    in preparation for a MP capable RDS-TCP.

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • A single rds_connection may have multiple rds_conn_paths that have
    to be carefully and correctly destroyed, for both rmmod and
    netns-delete cases.

    For both cases, we extract a single rds_tcp_connection for
    each conn into a temporary list, and then invoke rds_conn_destroy()
    which iteratively dismantles every path in the rds_connection.

    For the netns deletion case, we additionally have to make sure
    that we do not leave a socket in TIME_WAIT state, as this will
    hold up the netns deletion. Thus we call rds_tcp_conn_paths_destroy()
    to reset state quickly.

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • The struct rds_tcp_connection is the transport-specific private
    data structure that tracks TCP information per rds_conn_path.
    Modify this structure to have a back-pointer to the rds_conn_path
    for which it is the ->cp_transport_data.

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • The c_passive bit is only intended for the IB transport and will
    never be encountered in rds-tcp, so remove the dead logic that
    predicates on this bit.

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • Refactor code to avoid separate indirections for single-path
    and multipath transports. All transports (both single and mp-capable)
    will get a pointer to the rds_conn_path, and can trivially derive
    the rds_connection from the ->cp_conn.

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

18 Jun, 2016

1 commit

  • Fixes the following sparse warnings:

    net/rds/tcp.c:59:5: warning:
    symbol 'rds_tcp_min_sndbuf' was not declared. Should it be static?
    net/rds/tcp.c:60:5: warning:
    symbol 'rds_tcp_min_rcvbuf' was not declared. Should it be static?

    Signed-off-by: Wei Yongjun
    Acked-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Wei Yongjun
     

15 Jun, 2016

2 commits


08 Jun, 2016

3 commits

  • The send path needs to be quiesced before resetting callbacks from
    rds_tcp_accept_one(), and commit eb192840266f ("RDS:TCP: Synchronize
    rds_tcp_accept_one with rds_send_xmit when resetting t_sock") achieves
    this using the c_state and RDS_IN_XMIT bit following the pattern
    used by rds_conn_shutdown(). However this leaves the possibility
    of a race window as shown in the sequence below
    take t_conn_lock in rds_tcp_conn_connect
    send outgoing syn to peer
    drop t_conn_lock in rds_tcp_conn_connect
    incoming from peer triggers rds_tcp_accept_one, conn is
    marked CONNECTING
    wait for RDS_IN_XMIT to quiesce any rds_send_xmit threads
    call rds_tcp_reset_callbacks
    [.. race-window where incoming syn-ack can cause the conn
    to be marked UP from rds_tcp_state_change ..]
    lock_sock called from rds_tcp_reset_callbacks, and we set
    t_sock to null
    As soon as the conn is marked UP in the race-window above, rds_send_xmit()
    threads will proceed to rds_tcp_xmit and may encounter a null-pointer
    deref on the t_sock.

    Given that rds_tcp_state_change() is invoked in softirq context, whereas
    rds_tcp_reset_callbacks() is in workq context, and testing for RDS_IN_XMIT
    after lock_sock could result in a deadlock with tcp_sendmsg, this
    commit fixes the race by using a new c_state, RDS_TCP_RESETTING, which
    will prevent a transition to RDS_CONN_UP from rds_tcp_state_change().

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • When we switch a connection's sockets in rds_tcp_rest_callbacks,
    any partially sent datagram must be retransmitted on the new
    socket so that the receiver can correctly reassmble the RDS
    datagram. Use rds_send_reset() which is designed for this purpose.

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • When rds_tcp_accept_one() has to replace the existing tcp socket
    with a newer tcp socket (duelling-syn resolution), it must lock_sock()
    to suppress the rds_tcp_data_recv() path while callbacks are being
    changed. Also, existing RDS datagram reassembly state must be reset,
    so that the next datagram on the new socket does not have corrupted
    state. Similarly when resetting the newly accepted socket, appropriate
    locks and synchronization is needed.

    This commit ensures correct synchronization by invoking
    kernel_sock_shutdown to reset a newly accepted sock, and by taking
    appropriate lock_sock()s (for old and new sockets) when resetting
    existing callbacks.

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

04 May, 2016

2 commits

  • An arbitration scheme for duelling SYNs is implemented as part of
    commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
    outgoing socket in rds_tcp_accept_one()") which ensures that both nodes
    involved will arrive at the same arbitration decision. However, this
    needs to be synchronized with an outgoing SYN to be generated by
    rds_tcp_conn_connect(). This commit achieves the synchronization
    through the t_conn_lock mutex in struct rds_tcp_connection.

    The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring
    the t_conn_lock mutex. A SYN is sent out only if the RDS connection is
    not already UP (an UP would indicate that rds_tcp_accept_one() has
    completed 3WH, so no SYN needs to be generated).

    Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after
    acquiring the t_conn_lock mutex. The only acceptable states (to
    allow continuation of the arbitration logic) are UP (i.e., outgoing SYN
    was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
    outgoing SYN before we saw incoming SYN).

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • There is a race condition between rds_send_xmit -> rds_tcp_xmit
    and the code that deals with resolution of duelling syns added
    by commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
    outgoing socket in rds_tcp_accept_one()").

    Specifically, we may end up derefencing a null pointer in rds_send_xmit
    if we have the interleaving sequence:
    rds_tcp_accept_one rds_send_xmit

    conn is RDS_CONN_UP, so
    invoke rds_tcp_xmit

    tc = conn->c_transport_data
    rds_tcp_restore_callbacks
    /* reset t_sock */
    null ptr deref from tc->t_sock

    The race condition can be avoided without adding the overhead of
    additional locking in the xmit path: have rds_tcp_accept_one wait
    for rds_tcp_xmit threads to complete before resetting callbacks.
    The synchronization can be done in the same manner as rds_conn_shutdown().
    First set the rds_conn_state to something other than RDS_CONN_UP
    (so that new threads cannot get into rds_tcp_xmit()), then wait for
    RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any
    threads in rds_tcp_xmit are done.

    Fixes: 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
    outgoing socket in rds_tcp_accept_one()")
    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

19 Mar, 2016

2 commits

  • RDS_TCP_DEFAULT_BUFSIZE has been unused since commit 1edd6a14d24f
    ("RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune").

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • Add per-net sysctl tunables to set the size of sndbuf and
    rcvbuf on the kernel tcp socket.

    The tunables are added at /proc/sys/net/rds/tcp/rds_tcp_sndbuf
    and /proc/sys/net/rds/tcp/rds_tcp_rcvbuf.

    These values must be set before accept() or connect(),
    and there may be an arbitrary number of existing rds-tcp
    sockets when the tunable is modified. To make sure that all
    connections in the netns pick up the same value for the tunable,
    we reset existing rds-tcp connections in the netns, so that
    they can reconnect with the new parameters.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

11 Feb, 2016

1 commit


05 Oct, 2015

1 commit

  • Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K)
    clobbers efficient use of TSO because it inflates the size_goal
    that is computed in tcp_sendmsg/tcp_sendpage and skews packet
    latency, and the default values for these parameters actually
    results in significantly better performance.

    In request-response tests using rds-stress with a packet size of
    100K with 16 threads (test parameters -q 100000 -a 256 -t16 -d16)
    between a single pair of IP addresses achieves a throughput of
    6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under
    equivalent conditions on these platforms.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

08 Aug, 2015

2 commits

  • Register pernet subsys init/stop functions that will set up
    and tear down per-net RDS-TCP listen endpoints. Unregister
    pernet subusys functions on 'modprobe -r' to clean up these
    end points.

    Enable keepalive on both accept and connect socket endpoints.
    The keepalive timer expiration will ensure that client socket
    endpoints will be removed as appropriate from the netns when
    an interface is removed from a namespace.

    Register a device notifier callback that will clean up all
    sockets (and thus avoid the need to wait for keepalive timeout)
    when the loopback device is unregistered from the netns indicating
    that the netns is getting deleted.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • Open the sockets calling sock_create_kern() with the correct struct net
    pointer, and use that struct net pointer when verifying the
    address passed to rds_bind().

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

01 Nov, 2011

1 commit


04 Nov, 2010

1 commit


21 Oct, 2010

1 commit


09 Sep, 2010

4 commits


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

24 Aug, 2009

2 commits

  • Now that transports can be loaded in arbitrary order,
    it is important for rds_trans_get_preferred() to look
    for them in a particular order, instead of walking the list
    until it finds a transport that works for a given address.
    Now, each transport registers for a specific transport slot,
    and these are ordered so that preferred transports come first,
    and then if they are not loaded, other transports are queried.

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover
     
  • This code allows RDS to be tunneled over a TCP connection.

    RDMA operations are disabled when using TCP transport,
    but this frees RDS from the IB/RDMA stack dependency, and allows
    it to be used with standard Ethernet adapters, or in a VM.

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover