25 Feb, 2018

1 commit

  • commit 681648e67d43cf269c5590ecf021ed481f4551fc upstream.

    Commit 8edc3affc077 ("rds: tcp: Take explicit refcounts on struct net")
    introduces a regression in rds-tcp netns cleanup. The cleanup_net(),
    (and thus rds_tcp_dev_event notification) is only called from put_net()
    when all netns refcounts go to 0, but this cannot happen if the
    rds_connection itself is holding a c_net ref that it expects to
    release in rds_tcp_kill_sock.

    Instead, the rds_tcp_kill_sock callback should make sure to
    tear down state carefully, ensuring that the socket teardown
    is only done after all data-structures and workqs that depend
    on it are quiesced.

    The original motivation for commit 8edc3affc077 ("rds: tcp: Take explicit
    refcounts on struct net") was to resolve a race condition reported by
    syzkaller where workqs for tx/rx/connect were triggered after the
    namespace was deleted. Those worker threads should have been
    cancelled/flushed before socket tear-down and indeed,
    rds_conn_path_destroy() does try to sequence this by doing
    /* cancel cp_send_w */
    /* cancel cp_recv_w */
    /* flush cp_down_w */
    /* free data structures */
    Here the "flush cp_down_w" will trigger rds_conn_shutdown and thus
    invoke rds_tcp_conn_path_shutdown() to close the tcp socket, so that
    we ought to have satisfied the requirement that "socket-close is
    done after all other dependent state is quiesced". However,
    rds_conn_shutdown has a bug in that it *always* triggers the reconnect
    workq (and if connection is successful, we always restart tx/rx
    workqs so with the right timing, we risk the race conditions reported
    by syzkaller).

    Netns deletion is like module teardown- no need to restart a
    reconnect in this case. We can use the c_destroy_in_prog bit
    to avoid restarting the reconnect.

    Fixes: 8edc3affc077 ("rds: tcp: Take explicit refcounts on struct net")
    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sowmini Varadhan
     

04 Aug, 2017

1 commit

  • RDS over IB does not use multipath RDS, so the array
    of additional rds_conn_path structures is always superfluous
    in this case. Reduce the memory footprint of the rds module
    by making this a dynamic allocation predicated on whether
    the transport is mp_capable.

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Tested-by: Efrain Galaviz
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

17 Jul, 2017

1 commit

  • We could end up executing rds_conn_shutdown before the rds_recv_worker
    thread, then rds_conn_shutdown -> rds_tcp_conn_shutdown can do a
    sock_release and set sock->sk to null, which may interleave in bad
    ways with rds_recv_worker, e.g., it could result in:

    "BUG: unable to handle kernel NULL pointer dereference at 0000000000000078"
    [ffff881769f6fd70] release_sock at ffffffff815f337b
    [ffff881769f6fd90] rds_tcp_recv at ffffffffa043c888 [rds_tcp]
    [ffff881769f6fdb0] rds_recv_worker at ffffffffa04a4810 [rds]
    [ffff881769f6fde0] process_one_work at ffffffff810a14c1
    [ffff881769f6fe40] worker_thread at ffffffff810a1940
    [ffff881769f6fec0] kthread at ffffffff810a6b1e

    Also, do not enqueue any new shutdown workq items when the connection is
    shutting down (this may happen for rds-tcp in softirq mode, if a FIN
    or CLOSE is received while the modules is in the middle of an unload)

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

22 Jun, 2017

1 commit

  • If we are unloading the rds_tcp module, we can set linger to 1
    and drop pending packets to accelerate reconnect. The peer will
    end up resetting the connection based on new generation numbers
    of the new incarnation, so hanging on to unsent TCP packets via
    linger is mostly pointless in this case.

    Signed-off-by: Sowmini Varadhan
    Tested-by: Jenny Xu
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

17 Jun, 2017

1 commit

  • After commit 1a0e100fb2c9 ("RDS: TCP: Force every connection to be
    initiated by numerically smaller IP address") we no longer need
    the logic associated with cp_outgoing, so clean up usage of this
    field.

    Signed-off-by: Sowmini Varadhan
    Tested-by: Imanti Mendez
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

03 Apr, 2017

1 commit

  • …ROR by an intervening FIN

    rds_conn_shutdown() runs in workq context, and marks the rds_connection
    as DISCONNECTING before quiescing Tx/Rx paths. However, after all I/O
    has quiesced, we may still find the rds_connection state to be
    RDS_CONN_ERROR if an intervening FIN was processed in softirq context.

    This is not a fatal error: rds_conn_shutdown() should continue the
    shutdown, and there is no need to log noisy messages about this event.

    Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
    Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

    Sowmini Varadhan
     

08 Mar, 2017

1 commit

  • It is incorrect for the rds_connection to piggyback on the
    sock_net() refcount for the netns because this gives rise to
    a chicken-and-egg problem during rds_conn_destroy. Instead explicitly
    take a ref on the net, and hold the netns down till the connection
    tear-down is complete.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

03 Jan, 2017

1 commit


18 Nov, 2016

2 commits

  • When 2 RDS peers initiate an RDS-TCP connection simultaneously,
    there is a potential for "duelling syns" on either/both sides.
    See commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
    outgoing socket in rds_tcp_accept_one()") for a description of this
    condition, and the arbitration logic which ensures that the
    numerically large IP address in the TCP connection is bound to the
    RDS_TCP_PORT ("canonical ordering").

    The rds_connection should not be marked as RDS_CONN_UP until the
    arbitration logic has converged for the following reason. The sender
    may start transmitting RDS datagrams as soon as RDS_CONN_UP is set,
    and since the sender removes all datagrams from the rds_connection's
    cp_retrans queue based on TCP acks. If the TCP ack was sent from
    a tcp socket that got reset as part of duel aribitration (but
    before data was delivered to the receivers RDS socket layer),
    the sender may end up prematurely freeing the datagram, and
    the datagram is no longer reliably deliverable.

    This patch remedies that condition by making sure that, upon
    receipt of 3WH completion state change notification of TCP_ESTABLISHED
    in rds_tcp_state_change, we mark the rds_connection as RDS_CONN_UP
    if, and only if, the IP addresses and ports for the connection are
    canonically ordered. In all other cases, rds_tcp_state_change will
    force an rds_conn_path_drop(), and rds_queue_reconnect() on
    both peers will restart the connection to ensure canonical ordering.

    A side-effect of enforcing this condition in rds_tcp_state_change()
    is that rds_tcp_accept_one_path() can now be refactored for simplicity.
    It is also no longer possible to encounter an RDS_CONN_UP connection in
    the arbitration logic in rds_tcp_accept_one().

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • The RDS transport has to be able to distinguish between
    two types of failure events:
    (a) when the transport fails (e.g., TCP connection reset)
    but the RDS socket/connection layer on both sides stays
    the same
    (b) when the peer's RDS layer itself resets (e.g., due to module
    reload or machine reboot at the peer)
    In case (a) both sides must reconnect and continue the RDS messaging
    without any message loss or disruption to the message sequence numbers,
    and this is achieved by rds_send_path_reset().

    In case (b) we should reset all rds_connection state to the
    new incarnation of the peer. Examples of state that needs to
    be reset are next expected rx sequence number from, or messages to be
    retransmitted to, the new incarnation of the peer.

    To achieve this, the RDS handshake probe added as part of
    commit 5916e2c1554f ("RDS: TCP: Enable multipath RDS for TCP")
    is enhanced so that sender and receiver of the RDS ping-probe
    will add a generation number as part of the RDS_EXTHDR_GEN_NUM
    extension header. Each peer stores local and remote generation
    numbers as part of each rds_connection. Changes in generation
    number will be detected via incoming handshake probe ping
    request or response and will allow the receiver to reset rds_connection
    state.

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

17 Oct, 2016

1 commit

  • This macro's last use was removed in commit d769ef81d5b59
    ("RDS: Update rds_conn_shutdown to work with rds_conn_path")
    so make the macro and the __rds_conn_error function definition
    and declaration disappear.

    Signed-off-by: Joe Perches
    Acked-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Joe Perches
     

16 Jul, 2016

1 commit

  • Use RDS probe-ping to compute how many paths may be used with
    the peer, and to synchronously start the multiple paths. If mprds is
    supported, hash outgoing traffic to one of multiple paths in rds_sendmsg()
    when multipath RDS is supported by the transport.

    CC: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

02 Jul, 2016

3 commits


15 Jun, 2016

7 commits


25 Nov, 2015

1 commit

  • Sasha's found a NULL pointer dereference in the RDS connection code when
    sending a message to an apparently unbound socket. The problem is caused
    by the code checking if the socket is bound in rds_sendmsg(), which checks
    the rs_bound_addr field without taking a lock on the socket. This opens a
    race where rs_bound_addr is temporarily set but where the transport is not
    in rds_bind(), leading to a NULL pointer dereference when trying to
    dereference 'trans' in __rds_conn_create().

    Vegard wrote a reproducer for this issue, so kindly ask him to share if
    you're interested.

    I cannot reproduce the NULL pointer dereference using Vegard's reproducer
    with this patch, whereas I could without.

    Complete earlier incomplete fix to CVE-2015-6937:

    74e98eb08588 ("RDS: verify the underlying transport exists before creating a connection")

    Cc: David S. Miller
    Cc: stable@vger.kernel.org

    Reviewed-by: Vegard Nossum
    Reviewed-by: Sasha Levin
    Acked-by: Santosh Shilimkar
    Signed-off-by: Quentin Casasnovas
    Signed-off-by: David S. Miller

    Quentin Casasnovas
     

05 Oct, 2015

1 commit

  • Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock
    for an incoming connection.") modified rds-tcp so that an incoming SYN
    would ignore an existing "client" TCP connection which had the local
    port set to the transient port. The motivation for ignoring the existing
    "client" connection in f711a6ae was to avoid race conditions and an
    endless duel of reconnect attempts triggered by a restart/abort of one
    of the nodes in the TCP connection.

    However, having separate sockets for active and passive sides
    is avoidable, and the simpler model of a single TCP socket for
    both send and receives of all RDS connections associated with
    that tcp socket makes for easier observability. We avoid the race
    conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
    if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
    The c_outgoing bit is initialized in __rds_conn_create().

    A side-effect of re-using the client rds_connection for an incoming
    SYN is the potential of encountering duelling SYNs, i.e., we
    have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
    SYN. The logic to arbitrate this criss-crossing SYN exchange in
    rds_tcp_accept_one() has been modified to emulate the BGP state
    machine: the smaller IP address should back off from the connection attempt.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

10 Sep, 2015

1 commit

  • There was no verification that an underlying transport exists when creating
    a connection, this would cause dereferencing a NULL ptr.

    It might happen on sockets that weren't properly bound before attempting to
    send a message, which will cause a NULL ptr deref:

    [135546.047719] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
    [135546.051270] Modules linked in:
    [135546.051781] CPU: 4 PID: 15650 Comm: trinity-c4 Not tainted 4.2.0-next-20150902-sasha-00041-gbaa1222-dirty #2527
    [135546.053217] task: ffff8800835bc000 ti: ffff8800bc708000 task.ti: ffff8800bc708000
    [135546.054291] RIP: __rds_conn_create (net/rds/connection.c:194)
    [135546.055666] RSP: 0018:ffff8800bc70fab0 EFLAGS: 00010202
    [135546.056457] RAX: dffffc0000000000 RBX: 0000000000000f2c RCX: ffff8800835bc000
    [135546.057494] RDX: 0000000000000007 RSI: ffff8800835bccd8 RDI: 0000000000000038
    [135546.058530] RBP: ffff8800bc70fb18 R08: 0000000000000001 R09: 0000000000000000
    [135546.059556] R10: ffffed014d7a3a23 R11: ffffed014d7a3a21 R12: 0000000000000000
    [135546.060614] R13: 0000000000000001 R14: ffff8801ec3d0000 R15: 0000000000000000
    [135546.061668] FS: 00007faad4ffb700(0000) GS:ffff880252000000(0000) knlGS:0000000000000000
    [135546.062836] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [135546.063682] CR2: 000000000000846a CR3: 000000009d137000 CR4: 00000000000006a0
    [135546.064723] Stack:
    [135546.065048] ffffffffafe2055c ffffffffafe23fc1 ffffed00493097bf ffff8801ec3d0008
    [135546.066247] 0000000000000000 00000000000000d0 0000000000000000 ac194a24c0586342
    [135546.067438] 1ffff100178e1f78 ffff880320581b00 ffff8800bc70fdd0 ffff880320581b00
    [135546.068629] Call Trace:
    [135546.069028] ? __rds_conn_create (include/linux/rcupdate.h:856 net/rds/connection.c:134)
    [135546.069989] ? rds_message_copy_from_user (net/rds/message.c:298)
    [135546.071021] rds_conn_create_outgoing (net/rds/connection.c:278)
    [135546.071981] rds_sendmsg (net/rds/send.c:1058)
    [135546.072858] ? perf_trace_lock (include/trace/events/lock.h:38)
    [135546.073744] ? lockdep_init (kernel/locking/lockdep.c:3298)
    [135546.074577] ? rds_send_drop_to (net/rds/send.c:976)
    [135546.075508] ? __might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3795)
    [135546.076349] ? __might_fault (mm/memory.c:3795)
    [135546.077179] ? rds_send_drop_to (net/rds/send.c:976)
    [135546.078114] sock_sendmsg (net/socket.c:611 net/socket.c:620)
    [135546.078856] SYSC_sendto (net/socket.c:1657)
    [135546.079596] ? SYSC_connect (net/socket.c:1628)
    [135546.080510] ? trace_dump_stack (kernel/trace/trace.c:1926)
    [135546.081397] ? ring_buffer_unlock_commit (kernel/trace/ring_buffer.c:2479 kernel/trace/ring_buffer.c:2558 kernel/trace/ring_buffer.c:2674)
    [135546.082390] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
    [135546.083410] ? trace_event_raw_event_sys_enter (include/trace/events/syscalls.h:16)
    [135546.084481] ? do_audit_syscall_entry (include/trace/events/syscalls.h:16)
    [135546.085438] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
    [135546.085515] rds_ib_laddr_check(): addr 36.74.25.172 ret -99 node type -1

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sasha Levin
    Signed-off-by: David S. Miller

    Sasha Levin
     

06 Sep, 2015

1 commit


26 Aug, 2015

1 commit

  • If we get an ENOMEM during rds_ib_recv_refill, we might never come
    back and refill again later. Patch makes sure to kick krdsd into
    helping out.

    To achieve this we add RDS_RECV_REFILL flag and update in the refill
    path based on that so that at least some therad will keep posting
    receive buffers.

    Since krdsd and softirq both might race for refill, we decide to
    schedule on work queue based on ring_low instead of ring_empty.

    Reviewed-by: Ajaykumar Hotchandani
    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    santosh.shilimkar@oracle.com
     

08 Aug, 2015

1 commit


10 May, 2015

2 commits

  • When the peer of an RDS-TCP connection restarts, a reconnect
    attempt should only be made from the active side of the TCP
    connection, i.e. the side that has a transient TCP port
    number. Do not add the passive side of the TCP connection
    to the c_hash_node and thus avoid triggering rds_queue_reconnect()
    for passive rds connections.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • When running RDS over TCP, the active (client) side connects to the
    listening ("passive") side at the RDS_TCP_PORT. After the connection
    is established, if the client side reboots (potentially without even
    sending a FIN) the server still has a TCP socket in the esablished
    state. If the server-side now gets a new SYN comes from the client
    with a different client port, TCP will create a new socket-pair, but
    the RDS layer will incorrectly pull up the old rds_connection (which
    is still associated with the stale t_sock and RDS socket state).

    This patch corrects this behavior by having rds_tcp_accept_one()
    always create a new connection for an incoming TCP SYN.
    The rds and tcp state associated with the old socket-pair is cleaned
    up via the rds_tcp_state_change() callback which would typically be
    invoked in most cases when the client-TCP sends a FIN on TCP restart,
    triggering a transition to CLOSE_WAIT state. In the rarer event of client
    death without a FIN, TCP_KEEPALIVE probes on the socket will detect
    the stale socket, and the TCP transition to CLOSE state will trigger
    the RDS state cleanup.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

09 Apr, 2015

2 commits

  • If a determined set of concurrent senders keep the send queue full,
    we can loop forever inside rds_send_xmit. This fix has two parts.

    First we are dropping out of the while(1) loop after we've processed a
    large batch of messages.

    Second we add a generation number that gets bumped each time the
    xmit bit lock is acquired. If someone else has jumped in and
    made progress in the queue, we skip our goto restart.

    Original patch by Chris Mason.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • Passive connections were added for the case where one loopback IB
    connection between identical addresses needs another connection to store
    the second QP. Unfortunately, they were also created in the case where
    the addesses differ and we already have both QPs.

    This lead to a message reordering bug.

    - two different IB interfaces and addresses on a machine: A B
    - traffic is sent from A to B
    - connection from A-B is created, connect request sent
    - listening accepts connect request, B-A is created
    - traffic flows, next_rx is incremented
    - unacked messages exist on the retrans list
    - connection A-B is shut down, new connect request sent
    - listen sees existing loopback B-A, creates new passive B-A
    - retrans messages are sent and delivered because of 0 next_rx

    The problem is that the second connection request saw the previously
    existing parent connection. Instead of using it, and using the existing
    next_rx_seq state for the traffic between those IPs, it mistakenly
    thought that it had to create a passive connection.

    We fix this by only using passive connections in the special case where
    laddr and faddr match. In this case we'll only ever have one parent
    sending connection requests and one passive connection created as the
    listening path sees the existing parent connection which initiated the
    request.

    Original patch by Zach Brown

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

20 Oct, 2013

2 commits


28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

01 Nov, 2011

1 commit


21 Oct, 2010

1 commit


09 Sep, 2010

3 commits

  • Nothing was canceling the send and receive work that might have been
    queued as a conn was being destroyed.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • rds_conn_shutdown() can return before the connection is shut down when
    it encounters an existing state that it doesn't understand. This lets
    rds_conn_destroy() then start tearing down the conn from under paths
    that are still using it.

    It's more reliable the shutdown work and wait for krdsd to complete the
    shutdown callback. This stopped some hangs I was seeing where krdsd was
    trying to shut down a freed conn.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Right now there's nothing to stop the various paths that use
    rs->rs_transport from racing with rmmod and executing freed transport
    code. The simple fix is to have binding to a transport also hold a
    reference to the transport's module, removing this class of races.

    We already had an unused t_owner field which was set for the modular
    transports and which wasn't set for the built-in loop transport.

    Signed-off-by: Zach Brown

    Zach Brown