12 Sep, 2018

1 commit

  • When a rds sock is bound, it is inserted into the bind_hash_table
    which is protected by RCU. But when releasing rds sock, after it
    is removed from this hash table, it is freed immediately without
    respecting RCU grace period. This could cause some use-after-free
    as reported by syzbot.

    Mark the rds sock with SOCK_RCU_FREE before inserting it into the
    bind_hash_table, so that it would be always freed after a RCU grace
    period.

    The other problem is in rds_find_bound(), the rds sock could be
    freed in between rhashtable_lookup_fast() and rds_sock_addref(),
    so we need to extend RCU read lock protection in rds_find_bound()
    to close this race condition.

    Reported-and-tested-by: syzbot+8967084bcac563795dc6@syzkaller.appspotmail.com
    Reported-by: syzbot+93a5839deb355537440f@syzkaller.appspotmail.com
    Cc: Sowmini Varadhan
    Cc: Santosh Shilimkar
    Cc: rds-devel@oss.oracle.com
    Signed-off-by: Cong Wang
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Cong Wang
     

02 Aug, 2018

1 commit


24 Jul, 2018

2 commits

  • This patch enables RDS to use IPv6 addresses. For RDS/TCP, the
    listener is now an IPv6 endpoint which accepts both IPv4 and IPv6
    connection requests. RDS/RDMA/IB uses a private data (struct
    rds_ib_connect_private) exchange between endpoints at RDS connection
    establishment time to support RDMA. This private data exchange uses a
    32 bit integer to represent an IP address. This needs to be changed in
    order to support IPv6. A new private data struct
    rds6_ib_connect_private is introduced to handle this. To ensure
    backward compatibility, an IPv6 capable RDS stack uses another RDMA
    listener port (RDS_CM_PORT) to accept IPv6 connection. And it
    continues to use the original RDS_PORT for IPv4 RDS connections. When
    it needs to communicate with an IPv6 peer, it uses the RDS_CM_PORT to
    send the connection set up request.

    v5: Fixed syntax problem (David Miller).

    v4: Changed port history comments in rds.h (Sowmini Varadhan).

    v3: Added support to set up IPv4 connection using mapped address
    (David Miller).
    Added support to set up connection between link local and non-link
    addresses.
    Various review comments from Santosh Shilimkar and Sowmini Varadhan.

    v2: Fixed bound and peer address scope mismatched issue.
    Added back rds_connect() IPv6 changes.

    Signed-off-by: Ka-Cheong Poon
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Ka-Cheong Poon
     
  • This patch changes the internal representation of an IP address to use
    struct in6_addr. IPv4 address is stored as an IPv4 mapped address.
    All the functions which take an IP address as argument are also
    changed to use struct in6_addr. But RDS socket layer is not modified
    such that it still does not accept IPv6 address from an application.
    And RDS layer does not accept nor initiate IPv6 connections.

    v2: Fixed sparse warnings.

    Signed-off-by: Ka-Cheong Poon
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Ka-Cheong Poon
     

28 Dec, 2017

1 commit

  • If the rds_sock is not added to the bind_hash_table, we must
    reset rs_bound_addr so that rds_remove_bound will not trip on
    this rds_sock.

    rds_add_bound() does a rds_sock_put() in this failure path, so
    failing to reset rs_bound_addr will result in a socket refcount
    bug, and will trigger a WARN_ON with the stack shown below when
    the application subsequently tries to close the PF_RDS socket.

    WARNING: CPU: 20 PID: 19499 at net/rds/af_rds.c:496 \
    rds_sock_destruct+0x15/0x30 [rds]
    :
    __sk_destruct+0x21/0x190
    rds_remove_bound.part.13+0xb6/0x140 [rds]
    rds_release+0x71/0x120 [rds]
    sock_release+0x1a/0x70
    sock_close+0xe/0x20
    __fput+0xd5/0x210
    task_work_run+0x82/0xa0
    do_exit+0x2ce/0xb30
    ? syscall_trace_enter+0x1cc/0x2b0
    do_group_exit+0x39/0xa0
    SyS_exit_group+0x10/0x10
    do_syscall_64+0x61/0x1a0

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

29 Aug, 2017

1 commit


03 Jan, 2017

1 commit


16 Jul, 2016

1 commit

  • Use RDS probe-ping to compute how many paths may be used with
    the peer, and to synchronously start the multiple paths. If mprds is
    supported, hash outgoing traffic to one of multiple paths in rds_sendmsg()
    when multipath RDS is supported by the transport.

    CC: Santosh Shilimkar
    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

03 Nov, 2015

1 commit

  • To further improve the RDS connection scalabilty on massive systems
    where number of sockets grows into tens of thousands of sockets, there
    is a need of larger bind hashtable. Pre-allocated 8K or 16K table is
    not very flexible in terms of memory utilisation. The rhashtable
    infrastructure gives us the flexibility to grow the hashtbable based
    on use and also comes up with inbuilt efficient bucket(chain) handling.

    Reviewed-by: David Miller
    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    santosh.shilimkar@oracle.com
     

13 Oct, 2015

1 commit


01 Oct, 2015

3 commits

  • One global lock protecting hash-tables with 1024 buckets isn't
    efficient and it shows up in a massive systems with truck
    loads of RDS sockets serving multiple databases. The
    perf data clearly highlights the contention on the rw
    lock in these massive workloads.

    When the contention gets worse, the code gets into a state where
    it decides to back off on the lock. So while it has disabled interrupts,
    it sits and backs off on this lock get. This causes the system to
    become sluggish and eventually all sorts of bad things happen.

    The simple fix is to move the lock into the hash bucket and
    use per-bucket lock to improve the scalability.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • One need to take rds socket reference while using it and release it
    once done with it. rds_add_bind() code path does not do that so
    lets fix it.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • RDS bind and release locking scheme is very inefficient. It
    uses RCU for maintaining the bind hash-table which is great but
    it also needs to hold spinlock for [add/remove]_bound(). So
    overall usecase, the hash-table concurrent speedup doesn't pay off.
    In fact blocking nature of synchronize_rcu() makes the RDS
    socket shutdown too slow which hurts RDS performance since
    connection shutdown and re-connect happens quite often to
    maintain the RC part of the protocol.

    So we make the locking scheme simpler and more efficient by
    replacing spin_locks with reader/writer locks and getting rid
    off rcu for bind hash-table.

    In subsequent patch, we also covert the global lock with per-bucket
    lock to reduce the global lock contention.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     

08 Aug, 2015

1 commit


01 Jun, 2015

1 commit

  • An application may deterministically attach the underlying transport for
    a PF_RDS socket by invoking setsockopt(2) with the SO_RDS_TRANSPORT
    option at the SOL_RDS level. The integer argument to setsockopt must be
    one of the RDS_TRANS_* transport types, e.g., RDS_TRANS_TCP. The option
    must be specified before invoking bind(2) on the socket, and may only
    be used once on the socket. An attempt to set the option on a bound
    socket, or to invoke the option after a successful SO_RDS_TRANSPORT
    attachment, will return EOPNOTSUPP.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

15 Jan, 2014

1 commit


28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

17 Jun, 2011

1 commit


09 Sep, 2010

3 commits


24 Aug, 2009

1 commit


27 Feb, 2009

1 commit