25 Nov, 2015

1 commit

  • Sasha's found a NULL pointer dereference in the RDS connection code when
    sending a message to an apparently unbound socket. The problem is caused
    by the code checking if the socket is bound in rds_sendmsg(), which checks
    the rs_bound_addr field without taking a lock on the socket. This opens a
    race where rs_bound_addr is temporarily set but where the transport is not
    in rds_bind(), leading to a NULL pointer dereference when trying to
    dereference 'trans' in __rds_conn_create().

    Vegard wrote a reproducer for this issue, so kindly ask him to share if
    you're interested.

    I cannot reproduce the NULL pointer dereference using Vegard's reproducer
    with this patch, whereas I could without.

    Complete earlier incomplete fix to CVE-2015-6937:

    74e98eb08588 ("RDS: verify the underlying transport exists before creating a connection")

    Cc: David S. Miller
    Cc: stable@vger.kernel.org

    Reviewed-by: Vegard Nossum
    Reviewed-by: Sasha Levin
    Acked-by: Santosh Shilimkar
    Signed-off-by: Quentin Casasnovas
    Signed-off-by: David S. Miller

    Quentin Casasnovas
     

05 Oct, 2015

1 commit

  • Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock
    for an incoming connection.") modified rds-tcp so that an incoming SYN
    would ignore an existing "client" TCP connection which had the local
    port set to the transient port. The motivation for ignoring the existing
    "client" connection in f711a6ae was to avoid race conditions and an
    endless duel of reconnect attempts triggered by a restart/abort of one
    of the nodes in the TCP connection.

    However, having separate sockets for active and passive sides
    is avoidable, and the simpler model of a single TCP socket for
    both send and receives of all RDS connections associated with
    that tcp socket makes for easier observability. We avoid the race
    conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
    if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
    The c_outgoing bit is initialized in __rds_conn_create().

    A side-effect of re-using the client rds_connection for an incoming
    SYN is the potential of encountering duelling SYNs, i.e., we
    have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
    SYN. The logic to arbitrate this criss-crossing SYN exchange in
    rds_tcp_accept_one() has been modified to emulate the BGP state
    machine: the smaller IP address should back off from the connection attempt.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

10 Sep, 2015

1 commit

  • There was no verification that an underlying transport exists when creating
    a connection, this would cause dereferencing a NULL ptr.

    It might happen on sockets that weren't properly bound before attempting to
    send a message, which will cause a NULL ptr deref:

    [135546.047719] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
    [135546.051270] Modules linked in:
    [135546.051781] CPU: 4 PID: 15650 Comm: trinity-c4 Not tainted 4.2.0-next-20150902-sasha-00041-gbaa1222-dirty #2527
    [135546.053217] task: ffff8800835bc000 ti: ffff8800bc708000 task.ti: ffff8800bc708000
    [135546.054291] RIP: __rds_conn_create (net/rds/connection.c:194)
    [135546.055666] RSP: 0018:ffff8800bc70fab0 EFLAGS: 00010202
    [135546.056457] RAX: dffffc0000000000 RBX: 0000000000000f2c RCX: ffff8800835bc000
    [135546.057494] RDX: 0000000000000007 RSI: ffff8800835bccd8 RDI: 0000000000000038
    [135546.058530] RBP: ffff8800bc70fb18 R08: 0000000000000001 R09: 0000000000000000
    [135546.059556] R10: ffffed014d7a3a23 R11: ffffed014d7a3a21 R12: 0000000000000000
    [135546.060614] R13: 0000000000000001 R14: ffff8801ec3d0000 R15: 0000000000000000
    [135546.061668] FS: 00007faad4ffb700(0000) GS:ffff880252000000(0000) knlGS:0000000000000000
    [135546.062836] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [135546.063682] CR2: 000000000000846a CR3: 000000009d137000 CR4: 00000000000006a0
    [135546.064723] Stack:
    [135546.065048] ffffffffafe2055c ffffffffafe23fc1 ffffed00493097bf ffff8801ec3d0008
    [135546.066247] 0000000000000000 00000000000000d0 0000000000000000 ac194a24c0586342
    [135546.067438] 1ffff100178e1f78 ffff880320581b00 ffff8800bc70fdd0 ffff880320581b00
    [135546.068629] Call Trace:
    [135546.069028] ? __rds_conn_create (include/linux/rcupdate.h:856 net/rds/connection.c:134)
    [135546.069989] ? rds_message_copy_from_user (net/rds/message.c:298)
    [135546.071021] rds_conn_create_outgoing (net/rds/connection.c:278)
    [135546.071981] rds_sendmsg (net/rds/send.c:1058)
    [135546.072858] ? perf_trace_lock (include/trace/events/lock.h:38)
    [135546.073744] ? lockdep_init (kernel/locking/lockdep.c:3298)
    [135546.074577] ? rds_send_drop_to (net/rds/send.c:976)
    [135546.075508] ? __might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3795)
    [135546.076349] ? __might_fault (mm/memory.c:3795)
    [135546.077179] ? rds_send_drop_to (net/rds/send.c:976)
    [135546.078114] sock_sendmsg (net/socket.c:611 net/socket.c:620)
    [135546.078856] SYSC_sendto (net/socket.c:1657)
    [135546.079596] ? SYSC_connect (net/socket.c:1628)
    [135546.080510] ? trace_dump_stack (kernel/trace/trace.c:1926)
    [135546.081397] ? ring_buffer_unlock_commit (kernel/trace/ring_buffer.c:2479 kernel/trace/ring_buffer.c:2558 kernel/trace/ring_buffer.c:2674)
    [135546.082390] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
    [135546.083410] ? trace_event_raw_event_sys_enter (include/trace/events/syscalls.h:16)
    [135546.084481] ? do_audit_syscall_entry (include/trace/events/syscalls.h:16)
    [135546.085438] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
    [135546.085515] rds_ib_laddr_check(): addr 36.74.25.172 ret -99 node type -1

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sasha Levin
    Signed-off-by: David S. Miller

    Sasha Levin
     

06 Sep, 2015

1 commit


26 Aug, 2015

1 commit

  • If we get an ENOMEM during rds_ib_recv_refill, we might never come
    back and refill again later. Patch makes sure to kick krdsd into
    helping out.

    To achieve this we add RDS_RECV_REFILL flag and update in the refill
    path based on that so that at least some therad will keep posting
    receive buffers.

    Since krdsd and softirq both might race for refill, we decide to
    schedule on work queue based on ring_low instead of ring_empty.

    Reviewed-by: Ajaykumar Hotchandani
    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    santosh.shilimkar@oracle.com
     

08 Aug, 2015

1 commit


10 May, 2015

2 commits

  • When the peer of an RDS-TCP connection restarts, a reconnect
    attempt should only be made from the active side of the TCP
    connection, i.e. the side that has a transient TCP port
    number. Do not add the passive side of the TCP connection
    to the c_hash_node and thus avoid triggering rds_queue_reconnect()
    for passive rds connections.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • When running RDS over TCP, the active (client) side connects to the
    listening ("passive") side at the RDS_TCP_PORT. After the connection
    is established, if the client side reboots (potentially without even
    sending a FIN) the server still has a TCP socket in the esablished
    state. If the server-side now gets a new SYN comes from the client
    with a different client port, TCP will create a new socket-pair, but
    the RDS layer will incorrectly pull up the old rds_connection (which
    is still associated with the stale t_sock and RDS socket state).

    This patch corrects this behavior by having rds_tcp_accept_one()
    always create a new connection for an incoming TCP SYN.
    The rds and tcp state associated with the old socket-pair is cleaned
    up via the rds_tcp_state_change() callback which would typically be
    invoked in most cases when the client-TCP sends a FIN on TCP restart,
    triggering a transition to CLOSE_WAIT state. In the rarer event of client
    death without a FIN, TCP_KEEPALIVE probes on the socket will detect
    the stale socket, and the TCP transition to CLOSE state will trigger
    the RDS state cleanup.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

09 Apr, 2015

2 commits

  • If a determined set of concurrent senders keep the send queue full,
    we can loop forever inside rds_send_xmit. This fix has two parts.

    First we are dropping out of the while(1) loop after we've processed a
    large batch of messages.

    Second we add a generation number that gets bumped each time the
    xmit bit lock is acquired. If someone else has jumped in and
    made progress in the queue, we skip our goto restart.

    Original patch by Chris Mason.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • Passive connections were added for the case where one loopback IB
    connection between identical addresses needs another connection to store
    the second QP. Unfortunately, they were also created in the case where
    the addesses differ and we already have both QPs.

    This lead to a message reordering bug.

    - two different IB interfaces and addresses on a machine: A B
    - traffic is sent from A to B
    - connection from A-B is created, connect request sent
    - listening accepts connect request, B-A is created
    - traffic flows, next_rx is incremented
    - unacked messages exist on the retrans list
    - connection A-B is shut down, new connect request sent
    - listen sees existing loopback B-A, creates new passive B-A
    - retrans messages are sent and delivered because of 0 next_rx

    The problem is that the second connection request saw the previously
    existing parent connection. Instead of using it, and using the existing
    next_rx_seq state for the traffic between those IPs, it mistakenly
    thought that it had to create a passive connection.

    We fix this by only using passive connections in the special case where
    laddr and faddr match. In this case we'll only ever have one parent
    sending connection requests and one passive connection created as the
    listening path sees the existing parent connection which initiated the
    request.

    Original patch by Zach Brown

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

20 Oct, 2013

2 commits


28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

01 Nov, 2011

1 commit


21 Oct, 2010

1 commit


09 Sep, 2010

17 commits

  • Nothing was canceling the send and receive work that might have been
    queued as a conn was being destroyed.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • rds_conn_shutdown() can return before the connection is shut down when
    it encounters an existing state that it doesn't understand. This lets
    rds_conn_destroy() then start tearing down the conn from under paths
    that are still using it.

    It's more reliable the shutdown work and wait for krdsd to complete the
    shutdown callback. This stopped some hangs I was seeing where krdsd was
    trying to shut down a freed conn.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Right now there's nothing to stop the various paths that use
    rs->rs_transport from racing with rmmod and executing freed transport
    code. The simple fix is to have binding to a transport also hold a
    reference to the transport's module, removing this class of races.

    We already had an unused t_owner field which was set for the modular
    transports and which wasn't set for the built-in loop transport.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • rds_conn_destroy() can race with all other modifications of the
    rds_conn_count but it was modifying the count without locking.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • The trivial amount of memory saved isn't worth the cost of dealing with section
    mismatches.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Andy Grover
     
  • rds_send_xmit() was changed to hold an interrupt masking spinlock instead of a
    mutex so that it could be called from the IB receive tasklet path. This broke
    the TCP transport because its xmit method can block and masks and unmasks
    interrupts.

    This patch serializes callers to rds_send_xmit() with a simple bit instead of
    the current spinlock or previous mutex. This enables rds_send_xmit() to be
    called from any context and to call functions which block. Getting rid of the
    c_send_lock exposes the bare c_lock acquisitions which are changed to block
    interrupts.

    A waitqueue is added so that rds_conn_shutdown() can wait for callers to leave
    rds_send_xmit() before tearing down partial send state. This lets us get rid
    of c_senders.

    rds_send_xmit() is changed to check the conn state after acquiring the
    RDS_IN_XMIT bit to resolve races with the shutdown path. Previously both
    worked with the conn state and then the lock in the same order, allowing them
    to race and execute the paths concurrently.

    rds_send_reset() isn't racing with rds_send_xmit() now that rds_conn_shutdown()
    properly ensures that rds_send_xmit() can't start once the conn state has been
    changed. We can remove its previous use of the spinlock.

    Finally, c_send_generation is redundant. Callers can race to test the c_flags
    bit by simply retrying instead of racing to test the c_send_generation atomic.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • conn->c_lock is acquired in interrupt context. rds_conn_message_info() is
    called from user context and was acquiring c_lock without blocking interrupts,
    leading to possible deadlocks.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • A few paths had the same block of code to queue a connection's connect work if
    it was in the right state. Let's move this in to a helper function.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • The connection hash was almost entirely RCU ready, this
    just makes the final couple of changes to use RCU instead
    of spinlocks for everything.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • rds_conn_destroy really needs locking while it changes the
    connection hash.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This is the first in a long line of patches that tries to fix races
    between RDS connection shutdown and RDS traffic.

    Here we are maintaining a count of active senders to make sure
    the connection doesn't go away while they are using it.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • rds_send_xmit is required to loop around after it releases the lock
    because someone else could done a trylock, found someone working on the
    list and backed off.

    But, once we drop our lock, it is possible that someone else does come
    in and make progress on the list. We should detect this and not loop
    around if another process is actually working on the list.

    This patch adds a generation counter that is bumped every time we
    get the lock and do some send work. If the retry notices someone else
    has bumped the generation counter, it does not need to loop around and
    continue working.

    Signed-off-by: Chris Mason
    Signed-off-by: Andy Grover

    Chris Mason
     
  • This change allows us to call rds_send_xmit() from a tasklet,
    which is crucial to our new operating model.

    * Change c_send_lock to a spinlock
    * Update stats fields "sem_" to "_lock"
    * Remove unneeded rds_conn_is_sending()

    About locking between shutdown and send -- send checks if the
    connection is up. Shutdown puts the connection into
    DISCONNECTING. After this, all threads entering send will exit
    immediately. However, a thread could be *in* send_xmit(), so
    shutdown acquires the c_send_lock to ensure everyone is out
    before proceeding with connection shutdown.

    Signed-off-by: Andy Grover

    Andy Grover
     
  • RDMA is now an intrinsic part of RDS, so it's easier to just have
    a single header.

    Signed-off-by: Andy Grover

    Andy Grover
     
  • Favor "if (foo)" style over "if (foo != NULL)".

    Signed-off-by: Andy Grover

    Andy Grover
     
  • This fits better in connection.c, rather than threads.c.

    Signed-off-by: Andy Grover

    Andy Grover
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

30 Nov, 2009

1 commit


24 Aug, 2009

1 commit


20 Jul, 2009

2 commits


10 Apr, 2009

1 commit


27 Feb, 2009

1 commit

  • While arguably the fact that the underlying transport needs a
    connection to convey RDS's datagrame reliably is not important
    to rds proper, the transports implemented so far (IB and TCP)
    have both been connection-oriented, and so the connection
    state machine-related code is in the common rds code.

    This patch also includes several work items, to handle connecting,
    sending, receiving, and shutdown.

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover