25 Nov, 2015

1 commit

  • Sasha's found a NULL pointer dereference in the RDS connection code when
    sending a message to an apparently unbound socket. The problem is caused
    by the code checking if the socket is bound in rds_sendmsg(), which checks
    the rs_bound_addr field without taking a lock on the socket. This opens a
    race where rs_bound_addr is temporarily set but where the transport is not
    in rds_bind(), leading to a NULL pointer dereference when trying to
    dereference 'trans' in __rds_conn_create().

    Vegard wrote a reproducer for this issue, so kindly ask him to share if
    you're interested.

    I cannot reproduce the NULL pointer dereference using Vegard's reproducer
    with this patch, whereas I could without.

    Complete earlier incomplete fix to CVE-2015-6937:

    74e98eb08588 ("RDS: verify the underlying transport exists before creating a connection")

    Cc: David S. Miller
    Cc: stable@vger.kernel.org

    Reviewed-by: Vegard Nossum
    Reviewed-by: Sasha Levin
    Acked-by: Santosh Shilimkar
    Signed-off-by: Quentin Casasnovas
    Signed-off-by: David S. Miller

    Quentin Casasnovas
     

08 Nov, 2015

2 commits

  • Merge second patch-bomb from Andrew Morton:

    - most of the rest of MM

    - procfs

    - lib/ updates

    - printk updates

    - bitops infrastructure tweaks

    - checkpatch updates

    - nilfs2 update

    - signals

    - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
    dma-debug, dma-mapping, ...

    * emailed patches from Andrew Morton : (102 commits)
    ipc,msg: drop dst nil validation in copy_msg
    include/linux/zutil.h: fix usage example of zlib_adler32()
    panic: release stale console lock to always get the logbuf printed out
    dma-debug: check nents in dma_sync_sg*
    dma-mapping: tidy up dma_parms default handling
    pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
    kexec: use file name as the output message prefix
    fs, seqfile: always allow oom killer
    seq_file: reuse string_escape_str()
    fs/seq_file: use seq_* helpers in seq_hex_dump()
    coredump: change zap_threads() and zap_process() to use for_each_thread()
    coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
    signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
    signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
    signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
    signals: kill block_all_signals() and unblock_all_signals()
    nilfs2: fix gcc uninitialized-variable warnings in powerpc build
    nilfs2: fix gcc unused-but-set-variable warnings
    MAINTAINERS: nilfs2: add header file for tracing
    nilfs2: add tracepoints for analyzing reading and writing metadata files
    ...

    Linus Torvalds
     
  • Pull rdma updates from Doug Ledford:
    "This is my initial round of 4.4 merge window patches. There are a few
    other things I wish to get in for 4.4 that aren't in this pull, as
    this represents what has gone through merge/build/run testing and not
    what is the last few items for which testing is not yet complete.

    - "Checksum offload support in user space" enablement
    - Misc cxgb4 fixes, add T6 support
    - Misc usnic fixes
    - 32 bit build warning fixes
    - Misc ocrdma fixes
    - Multicast loopback prevention extension
    - Extend the GID cache to store and return attributes of GIDs
    - Misc iSER updates
    - iSER clustering update
    - Network NameSpace support for rdma CM
    - Work Request cleanup series
    - New Memory Registration API"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (76 commits)
    IB/core, cma: Make __attribute_const__ declarations sparse-friendly
    IB/core: Remove old fast registration API
    IB/ipath: Remove fast registration from the code
    IB/hfi1: Remove fast registration from the code
    RDMA/nes: Remove old FRWR API
    IB/qib: Remove old FRWR API
    iw_cxgb4: Remove old FRWR API
    RDMA/cxgb3: Remove old FRWR API
    RDMA/ocrdma: Remove old FRWR API
    IB/mlx4: Remove old FRWR API support
    IB/mlx5: Remove old FRWR API support
    IB/srp: Dont allocate a page vector when using fast_reg
    IB/srp: Remove srp_finish_mapping
    IB/srp: Convert to new registration API
    IB/srp: Split srp_map_sg
    RDS/IW: Convert to new memory registration API
    svcrdma: Port to new memory registration API
    xprtrdma: Port to new memory registration API
    iser-target: Port to new memory registration API
    IB/iser: Port to new fast registration API
    ...

    Linus Torvalds
     

07 Nov, 2015

1 commit

  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

03 Nov, 2015

2 commits

  • To further improve the RDS connection scalabilty on massive systems
    where number of sockets grows into tens of thousands of sockets, there
    is a need of larger bind hashtable. Pre-allocated 8K or 16K table is
    not very flexible in terms of memory utilisation. The rhashtable
    infrastructure gives us the flexibility to grow the hashtbable based
    on use and also comes up with inbuilt efficient bucket(chain) handling.

    Reviewed-by: David Miller
    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    santosh.shilimkar@oracle.com
     
  • as result of function rds_iw_flush_mr_pool is nowhere checked,
    changing its return type from int to void.
    also removing the unused variable rc as there is nothing to return

    Signed-off-by: Saurabh Sengar
    Signed-off-by: David S. Miller

    Saurabh Sengar
     

01 Nov, 2015

1 commit


29 Oct, 2015

3 commits

  • Get rid of fast_reg page list and its construction.
    Instead, just pass the RDS sg list to ib_map_mr_sg
    and post the new ib_reg_wr.

    This is done both for server IW RDMA_READ registration
    and the client remote key registration.

    Signed-off-by: Sagi Grimberg
    Acked-by: Christoph Hellwig
    Acked-by: Santosh Shilimkar
    Signed-off-by: Doug Ledford

    Sagi Grimberg
     
  • Doug Ledford
     
  • Add support for network namespaces in the ib_cma module. This is
    accomplished by:

    1. Adding network namespace parameter for rdma_create_id. This parameter is
    used to populate the network namespace field in rdma_id_private.
    rdma_create_id keeps a reference on the network namespace.
    2. Using the network namespace from the rdma_id instead of init_net inside
    of ib_cma, when listening on an ID and when looking for an ID for an
    incoming request.
    3. Decrementing the reference count for the appropriate network namespace
    when calling rdma_destroy_id.

    In order to preserve the current behavior init_net is passed when calling
    from other modules.

    Signed-off-by: Guy Shapiro
    Signed-off-by: Haggai Eran
    Signed-off-by: Yotam Kenneth
    Signed-off-by: Shachar Raindel
    Signed-off-by: Doug Ledford

    Guy Shapiro
     

28 Oct, 2015

1 commit

  • Either of pskb_pull() or pskb_trim() may fail under low memory conditions.
    If rds_tcp_data_recv() ignores such failures, the application will
    receive corrupted data because the skb has not been correctly
    carved to the RDS datagram size.

    Avoid this by handling pskb_pull/pskb_trim failure in the same
    manner as the skb_clone failure: bail out of rds_tcp_data_recv(), and
    retry via the deferred call to rds_send_worker() that gets set up on
    ENOMEM from rds_tcp_read_sock()

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

19 Oct, 2015

1 commit

  • Sowmini found hang with rds-ping while testing RDS over TCP. Its
    a corner case and doesn't happen always. The issue is not reproducible
    with IB transport. Its clear from below dump why we see it with RDS TCP.

    [] do_tcp_setsockopt+0xb5/0x740
    [] tcp_setsockopt+0x24/0x30
    [] sock_common_setsockopt+0x14/0x20
    [] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp]
    [] rds_send_xmit+0xd7/0x740 [rds]
    [] rds_send_pong+0x142/0x180 [rds]
    [] rds_recv_incoming+0x274/0x330 [rds]
    [] ? ttwu_queue+0x11e/0x130
    [] ? skb_copy_bits+0x6d/0x2c0
    [] rds_tcp_data_recv+0x2f0/0x3d0 [rds_tcp]
    [] tcp_read_sock+0x96/0x1c0
    [] ? rds_tcp_recv_init+0x40/0x40 [rds_tcp]
    [] ? sock_def_write_space+0xa0/0xa0
    [] rds_tcp_data_ready+0xa1/0xf0 [rds_tcp]
    [] tcp_data_queue+0x379/0x5b0
    [] ? rds_tcp_write_space+0xbb/0x110 [rds_tcp]
    [] tcp_rcv_established+0x2e2/0x6e0
    [] tcp_v4_do_rcv+0x122/0x220
    [] tcp_v4_rcv+0x867/0x880
    [] ip_local_deliver_finish+0xa3/0x220

    This happens because rds_send_xmit() chain wants to take
    sock_lock which is already taken by tcp_v4_rcv() on its
    way to rds_tcp_data_ready(). Commit db6526dcb51b ("RDS: use
    rds_send_xmit() state instead of RDS_LL_SEND_FULL") which
    was trying to opportunistically finish the send request
    in same thread context.

    But because of above recursive lock hang with RDS TCP,
    the send work from rds_send_pong() needs to deferred to
    worker to avoid lock up. Given RDS ping is more of connectivity
    test than performance critical path, its should be ok even
    for transport like IB.

    Reported-by: Sowmini Varadhan
    Acked-by: Sowmini Varadhan
    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar
    Acked-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    santosh.shilimkar@oracle.com
     

13 Oct, 2015

2 commits

  • Consider the following "duelling syn" sequence between two peers A and B:
    A B
    SYN1 -->

    Note that the SYN/ACK has already been sent out by TCP before
    rds_tcp_accept_one() gets invoked as part of callbacks.

    If the inet_addr(A) is numerically less than inet_addr(B),
    the arbitration scheme in rds_tcp_accept_one() will prefer the
    TCP connection triggered by SYN1, and will send a CLOSE for the
    SYN2 (just after the SYN2ACK was sent).

    Since B also follows the same arbitration scheme, it will send the SYN-ACK
    for SYN1 that will set up a healthy ESTABLISHED connection on both sides.
    B will also get a CLOSE for SYN2, which should result in the cleanup
    of the TCP state machine for SYN2, but it should not trigger any
    stale RDS-TCP callbacks (such as ->writespace, ->state_change etc),
    that would disrupt the progress of the SYN2 based RDS-TCP connection.

    Thus the arbitration scheme in rds_tcp_accept_one() should restore
    rds_tcp callbacks for the winner before setting them up for the
    new accept socket, and also make sure that conn->c_outgoing
    is set to 0 so that we do not trigger any reconnect attempts on the
    passive side of the tcp socket in the future, in conformance with
    commit c82ac7e69efe ("net/rds: RDS-TCP: only initiate reconnect attempt
    on outgoing TCP socket.")

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • The IP address passed to rds_bind() should be vetted by the
    transport's ->laddr_check() for a previously bound transport.
    This needs to be done to avoid cases where, for example,
    the application has asked for an IB transport,
    but the IP address passed to bind is only usable on
    ethernet interfaces.

    Signed-off-by: Sowmini Varadhan
    Acked-by: Santosh Shilimkar
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

08 Oct, 2015

2 commits

  • Santosh Shilimkar says:

    ====================
    RDS: connection scalability and performance improvements

    [v4]
    Re-sending the same patches from v3 again since my repost of
    patch 05/14 from v3 was whitespace damaged.

    [v3]
    Updated patch "[PATCH v2 05/14] RDS: defer the over_batch work to
    send worker" as per David Miller's comment [4] to avoid the magic
    value usage. Patch now makes use of already available but unused
    send_batch_count module parameter. Rest of the patches are same as
    earlier version v2 [3]

    [v2]:
    Dropped "[PATCH 05/15] RDS: increase size of hash-table to 8K" from
    earlier version [1]. I plan to address the hash table scalability using
    re-sizable hash tables as suggested by David Laight and David Miller [2]

    This series addresses RDS connection bottlenecks on massive workloads and
    improve the RDMA performance almost by 3X. RDS TCP also gets a small gain
    of about 12%.

    RDS is being used in massive systems with high scalability where several
    hundred thousand end points and tens of thousands of local processes
    are operating in tens of thousand sockets. Being RC(reliable connection),
    socket bind and release happens very often and any inefficiencies in
    bind hash look ups hurts the overall system performance. RDS bin hash-table
    uses global spin-lock which is the biggest bottleneck. To make matter worst,
    it uses rcu inside global lock for hash buckets.
    This is being addressed by simply using per bucket rw lock which makes the
    locking simple and very efficient. The hash table size is still an issue and
    I plan to address it by using re-sizable hash tables as suggested on the list.

    For RDS RDMA improvement, the completion handling is revamped so that we
    can do batch completions. Both send and receive completion handlers are
    split logically to achieve the same. RDS 8K messages being one of the
    key usecase, mr pool is adapted to have the 8K mrs along with default 1M
    mrs. And while doing this, few fixes and couple of bottlenecks seen with
    rds_sendmsg() are addressed.

    Series applies against 4.3-rc1 as well net-next. Its tested on Oracle
    hardware with IB fabric for both bcopy as well as RDMA mode. RDS TCP is
    tested with iXGB NIC. Like last time, iWARP transport is untested with
    these changes. The patchset is also available at below git repo:

    git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git net/rds/4.3-v3

    As a side note, the IB HCA driver I used for testing misses at least 3
    important patches in upstream to see the full blown IB performance and
    am hoping to get that in mainline with help of them.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This patch split up struct ib_send_wr so that all non-trivial verbs
    use their own structure which embedds struct ib_send_wr. This dramaticly
    shrinks the size of a WR for most common operations:

    sizeof(struct ib_send_wr) (old): 96

    sizeof(struct ib_send_wr): 48
    sizeof(struct ib_rdma_wr): 64
    sizeof(struct ib_atomic_wr): 96
    sizeof(struct ib_ud_wr): 88
    sizeof(struct ib_fast_reg_wr): 88
    sizeof(struct ib_bind_mw_wr): 96
    sizeof(struct ib_sig_handover_wr): 80

    And with Sagi's pending MR rework the fast registration WR will also be
    down to a reasonable size:

    sizeof(struct ib_fastreg_wr): 64

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche [srp, srpt]
    Reviewed-by: Chuck Lever [sunrpc]
    Tested-by: Haggai Eran
    Tested-by: Sagi Grimberg
    Tested-by: Steve Wise

    Christoph Hellwig
     

06 Oct, 2015

10 commits

  • 8K message sizes are pretty important usecase for RDS current
    workloads so we make provison to have 8K mrs available from the pool.
    Based on number of SG's in the RDS message, we pick a pool to use.

    Also to make sure that we don't under utlise mrs when say 8k messages
    are dominating which could lead to 8k pull being exhausted, we fall-back
    to 1m pool till 8k pool recovers for use.

    This helps to at least push ~55 kB/s bidirectional data which
    is a nice improvement.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • All HCA drivers seems to popullate max_mr caps and few of
    them do both max_mr and max_fmr.

    Hence update RDS code to make use of max_mr.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • Fix below warning by marking rds_ib_fmr_wq static

    net/rds/ib_rdma.c:87:25: warning: symbol 'rds_ib_fmr_wq' was not declared. Should it be static?

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • rds_ib_mr already keeps the pool handle which it associates
    with. Lets use that instead of round about way of fetching
    it from rds_ib_device.

    No functional change.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • RDS IB mr pool has its own workqueue 'rds_ib_fmr_wq', so we need
    to use queue_delayed_work() to kick the work. This was hurting
    the performance since pool maintenance was less often triggered
    from other path.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • Just in case we are still handling the QP receive completion while the
    rds_ibdev is released, drop the connection instead of crashing the kernel.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • Similar to what we did with receive CQ completion handling, we split
    the transmit completion handler so that it lets us implement batched
    work completion handling.

    We re-use the cq_poll routine and makes use of RDS_IB_SEND_OP to
    identify the send vs receive completion event handler invocation.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • For better performance, we split the receive completion IRQ handler. That
    lets us acknowledge several WCE events in one call. We also limit the WC
    to max 32 to avoid latency. Acknowledging several completions in one call
    instead of several calls each time will provide better performance since
    less mutual exclusion locks are being performed.

    In next patch, send completion is also split which re-uses the poll_cq()
    and hence the code is moved to ib_cm.c

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • In Transport indepedent rds_sendmsg(), we shouldn't make decisions based
    on RDS_LL_SEND_FULL which is used to manage the ring for RDMA based
    transports. We can safely issue rds_send_xmit() and the using its
    return value take decision on deferred work. This will also fix
    the scenario where at times we are seeing connections stuck with
    the LL_SEND_FULL bit getting set and never cleared.

    We kick krdsd after any time we see -ENOMEM or -EAGAIN from the
    ring allocation code.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • Current process gives up if its send work over the batch limit.
    The work queue will get kicked to finish off any other requests.
    This fixes remainder condition from commit 443be0e5affe ("RDS: make
    sure not to loop forever inside rds_send_xmit").

    The restart condition is only for the case where we reached to
    over_batch code for some other reason so just retrying again
    before giving up.

    While at it, make sure we use already available 'send_batch_count'
    parameter instead of magic value. The batch count threshold value
    of 1024 came via commit 443be0e5affe ("RDS: make sure not to loop
    forever inside rds_send_xmit"). The idea is to process as big a
    batch as we can but at the same time we don't hold other waiting
    processes for send. Hence back-off after the send_batch_count
    limit (1024) to avoid soft-lock ups.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     

05 Oct, 2015

3 commits

  • For the same reasons as commit 2f5338442425 ("tcp: allow splice() to
    build full TSO packets") and commit 35f9c09fe9c7 ("tcp: tcp_sendpages()
    should call tcp_push() once"), rds_tcp_xmit may have multiple pages to
    send, so use the MSG_MORE and MSG_SENDPAGE_NOTLAST as hints to
    tcp_sendpage()

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K)
    clobbers efficient use of TSO because it inflates the size_goal
    that is computed in tcp_sendmsg/tcp_sendpage and skews packet
    latency, and the default values for these parameters actually
    results in significantly better performance.

    In request-response tests using rds-stress with a packet size of
    100K with 16 threads (test parameters -q 100000 -a 256 -t16 -d16)
    between a single pair of IP addresses achieves a throughput of
    6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under
    equivalent conditions on these platforms.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     
  • Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock
    for an incoming connection.") modified rds-tcp so that an incoming SYN
    would ignore an existing "client" TCP connection which had the local
    port set to the transient port. The motivation for ignoring the existing
    "client" connection in f711a6ae was to avoid race conditions and an
    endless duel of reconnect attempts triggered by a restart/abort of one
    of the nodes in the TCP connection.

    However, having separate sockets for active and passive sides
    is avoidable, and the simpler model of a single TCP socket for
    both send and receives of all RDS connections associated with
    that tcp socket makes for easier observability. We avoid the race
    conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
    if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
    The c_outgoing bit is initialized in __rds_conn_create().

    A side-effect of re-using the client rds_connection for an incoming
    SYN is the potential of encountering duelling SYNs, i.e., we
    have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
    SYN. The logic to arbitrate this criss-crossing SYN exchange in
    rds_tcp_accept_one() has been modified to emulate the BGP state
    machine: the smaller IP address should back off from the connection attempt.

    Signed-off-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    Sowmini Varadhan
     

01 Oct, 2015

4 commits

  • One global lock protecting hash-tables with 1024 buckets isn't
    efficient and it shows up in a massive systems with truck
    loads of RDS sockets serving multiple databases. The
    perf data clearly highlights the contention on the rw
    lock in these massive workloads.

    When the contention gets worse, the code gets into a state where
    it decides to back off on the lock. So while it has disabled interrupts,
    it sits and backs off on this lock get. This causes the system to
    become sluggish and eventually all sorts of bad things happen.

    The simple fix is to move the lock into the hash bucket and
    use per-bucket lock to improve the scalability.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • One need to take rds socket reference while using it and release it
    once done with it. rds_add_bind() code path does not do that so
    lets fix it.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • RDS bind and release locking scheme is very inefficient. It
    uses RCU for maintaining the bind hash-table which is great but
    it also needs to hold spinlock for [add/remove]_bound(). So
    overall usecase, the hash-table concurrent speedup doesn't pay off.
    In fact blocking nature of synchronize_rcu() makes the RDS
    socket shutdown too slow which hurts RDS performance since
    connection shutdown and re-connect happens quite often to
    maintain the RC part of the protocol.

    So we make the locking scheme simpler and more efficient by
    replacing spin_locks with reader/writer locks and getting rid
    off rcu for bind hash-table.

    In subsequent patch, we also covert the global lock with per-bucket
    lock to reduce the global lock contention.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     
  • synchronize_rcu() slowing down un-necessarily the socket shutdown
    path. It is used just kfree() the ip addresses in rds_ib_remove_ipaddr()
    which is perfect usecase for kfree_rcu();

    So lets use that to gain some speedup.

    Signed-off-by: Santosh Shilimkar
    Signed-off-by: Santosh Shilimkar

    Santosh Shilimkar
     

11 Sep, 2015

1 commit

  • Pull networking fixes from David Miller:

    1) Fix out-of-bounds array access in netfilter ipset, from Jozsef
    Kadlecsik.

    2) Use correct free operation on netfilter conntrack templates, from
    Daniel Borkmann.

    3) Fix route leak in SCTP, from Marcelo Ricardo Leitner.

    4) Fix sizeof(pointer) in mac80211, from Thierry Reding.

    5) Fix cache pointer comparison in ip6mr leading to missed unlock of
    mrt_lock. From Richard Laing.

    6) rds_conn_lookup() needs to consider network namespace in key
    comparison, from Sowmini Varadhan.

    7) Fix deadlock in TIPC code wrt broadcast link wakeups, from Kolmakov
    Dmitriy.

    8) Fix fd leaks in bpf syscall, from Daniel Borkmann.

    9) Fix error recovery when installing ipv6 multipath routes, we would
    delete the old route before we would know if we could fully commit
    to the new set of nexthops. Fix from Roopa Prabhu.

    10) Fix run-time suspend problems in r8152, from Hayes Wang.

    11) In fec, don't program the MAC address into the chip when the clocks
    are gated off. From Fugang Duan.

    12) Fix poll behavior for netlink sockets when using rx ring mmap, from
    Daniel Borkmann.

    13) Don't allocate memory with GFP_KERNEL from get_stats64 in r8169
    driver, from Corinna Vinschen.

    14) In TCP Cubic congestion control, handle idle periods better where we
    are application limited, in order to keep cwnd from growing out of
    control. From Eric Dumzet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
    tcp_cubic: better follow cubic curve after idle period
    tcp: generate CA_EVENT_TX_START on data frames
    xen-netfront: respect user provided max_queues
    xen-netback: respect user provided max_queues
    r8169: Fix sleeping function called during get_stats64, v2
    ether: add IEEE 1722 ethertype - TSN
    netlink, mmap: fix edge-case leakages in nf queue zero-copy
    netlink, mmap: don't walk rx ring on poll if receive queue non-empty
    cxgb4: changes for new firmware 1.14.4.0
    net: fec: add netif status check before set mac address
    r8152: fix the runtime suspend issues
    r8152: split DRIVER_VERSION
    ipv6: fix ifnullfree.cocci warnings
    add microchip LAN88xx phy driver
    stmmac: fix check for phydev being open
    net: qlcnic: delete redundant memsets
    net: mv643xx_eth: use kzalloc
    net: jme: use kzalloc() instead of kmalloc+memset
    net: cavium: liquidio: use kzalloc in setup_glist()
    net: ipv6: use common fib_default_rule_pref
    ...

    Linus Torvalds
     

10 Sep, 2015

1 commit

  • There was no verification that an underlying transport exists when creating
    a connection, this would cause dereferencing a NULL ptr.

    It might happen on sockets that weren't properly bound before attempting to
    send a message, which will cause a NULL ptr deref:

    [135546.047719] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
    [135546.051270] Modules linked in:
    [135546.051781] CPU: 4 PID: 15650 Comm: trinity-c4 Not tainted 4.2.0-next-20150902-sasha-00041-gbaa1222-dirty #2527
    [135546.053217] task: ffff8800835bc000 ti: ffff8800bc708000 task.ti: ffff8800bc708000
    [135546.054291] RIP: __rds_conn_create (net/rds/connection.c:194)
    [135546.055666] RSP: 0018:ffff8800bc70fab0 EFLAGS: 00010202
    [135546.056457] RAX: dffffc0000000000 RBX: 0000000000000f2c RCX: ffff8800835bc000
    [135546.057494] RDX: 0000000000000007 RSI: ffff8800835bccd8 RDI: 0000000000000038
    [135546.058530] RBP: ffff8800bc70fb18 R08: 0000000000000001 R09: 0000000000000000
    [135546.059556] R10: ffffed014d7a3a23 R11: ffffed014d7a3a21 R12: 0000000000000000
    [135546.060614] R13: 0000000000000001 R14: ffff8801ec3d0000 R15: 0000000000000000
    [135546.061668] FS: 00007faad4ffb700(0000) GS:ffff880252000000(0000) knlGS:0000000000000000
    [135546.062836] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [135546.063682] CR2: 000000000000846a CR3: 000000009d137000 CR4: 00000000000006a0
    [135546.064723] Stack:
    [135546.065048] ffffffffafe2055c ffffffffafe23fc1 ffffed00493097bf ffff8801ec3d0008
    [135546.066247] 0000000000000000 00000000000000d0 0000000000000000 ac194a24c0586342
    [135546.067438] 1ffff100178e1f78 ffff880320581b00 ffff8800bc70fdd0 ffff880320581b00
    [135546.068629] Call Trace:
    [135546.069028] ? __rds_conn_create (include/linux/rcupdate.h:856 net/rds/connection.c:134)
    [135546.069989] ? rds_message_copy_from_user (net/rds/message.c:298)
    [135546.071021] rds_conn_create_outgoing (net/rds/connection.c:278)
    [135546.071981] rds_sendmsg (net/rds/send.c:1058)
    [135546.072858] ? perf_trace_lock (include/trace/events/lock.h:38)
    [135546.073744] ? lockdep_init (kernel/locking/lockdep.c:3298)
    [135546.074577] ? rds_send_drop_to (net/rds/send.c:976)
    [135546.075508] ? __might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3795)
    [135546.076349] ? __might_fault (mm/memory.c:3795)
    [135546.077179] ? rds_send_drop_to (net/rds/send.c:976)
    [135546.078114] sock_sendmsg (net/socket.c:611 net/socket.c:620)
    [135546.078856] SYSC_sendto (net/socket.c:1657)
    [135546.079596] ? SYSC_connect (net/socket.c:1628)
    [135546.080510] ? trace_dump_stack (kernel/trace/trace.c:1926)
    [135546.081397] ? ring_buffer_unlock_commit (kernel/trace/ring_buffer.c:2479 kernel/trace/ring_buffer.c:2558 kernel/trace/ring_buffer.c:2674)
    [135546.082390] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
    [135546.083410] ? trace_event_raw_event_sys_enter (include/trace/events/syscalls.h:16)
    [135546.084481] ? do_audit_syscall_entry (include/trace/events/syscalls.h:16)
    [135546.085438] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
    [135546.085515] rds_ib_laddr_check(): addr 36.74.25.172 ret -99 node type -1

    Acked-by: Santosh Shilimkar
    Signed-off-by: Sasha Levin
    Signed-off-by: David S. Miller

    Sasha Levin
     

09 Sep, 2015

1 commit

  • Pull inifiniband/rdma updates from Doug Ledford:
    "This is a fairly sizeable set of changes. I've put them through a
    decent amount of testing prior to sending the pull request due to
    that.

    There are still a few fixups that I know are coming, but I wanted to
    go ahead and get the big, sizable chunk into your hands sooner rather
    than waiting for those last few fixups.

    Of note is the fact that this creates what is intended to be a
    temporary area in the drivers/staging tree specifically for some
    cleanups and additions that are coming for the RDMA stack. We
    deprecated two drivers (ipath and amso1100) and are waiting to hear
    back if we can deprecate another one (ehca). We also put Intel's new
    hfi1 driver into this area because it needs to be refactored and a
    transfer library created out of the factored out code, and then it and
    the qib driver and the soft-roce driver should all be modified to use
    that library.

    I expect drivers/staging/rdma to be around for three or four kernel
    releases and then to go away as all of the work is completed and final
    deletions of deprecated drivers are done.

    Summary of changes for 4.3:

    - Create drivers/staging/rdma
    - Move amso1100 driver to staging/rdma and schedule for deletion
    - Move ipath driver to staging/rdma and schedule for deletion
    - Add hfi1 driver to staging/rdma and set TODO for move to regular
    tree
    - Initial support for namespaces to be used on RDMA devices
    - Add RoCE GID table handling to the RDMA core caching code
    - Infrastructure to support handling of devices with differing read
    and write scatter gather capabilities
    - Various iSER updates
    - Kill off unsafe usage of global mr registrations
    - Update SRP driver
    - Misc mlx4 driver updates
    - Support for the mr_alloc verb
    - Support for a netlink interface between kernel and user space cache
    daemon to speed path record queries and route resolution
    - Ininitial support for safe hot removal of verbs devices"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
    IB/ipoib: Suppress warning for send only join failures
    IB/ipoib: Clean up send-only multicast joins
    IB/srp: Fix possible protection fault
    IB/core: Move SM class defines from ib_mad.h to ib_smi.h
    IB/core: Remove unnecessary defines from ib_mad.h
    IB/hfi1: Add PSM2 user space header to header_install
    IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
    mlx5: Fix incorrect wc pkey_index assignment for GSI messages
    IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
    IB/uverbs: reject invalid or unknown opcodes
    IB/cxgb4: Fix if statement in pick_local_ip6adddrs
    IB/sa: Fix rdma netlink message flags
    IB/ucma: HW Device hot-removal support
    IB/mlx4_ib: Disassociate support
    IB/uverbs: Enable device removal when there are active user space applications
    IB/uverbs: Explicitly pass ib_dev to uverbs commands
    IB/uverbs: Fix race between ib_uverbs_open and remove_one
    IB/uverbs: Fix reference counting usage of event files
    IB/core: Make ib_dealloc_pd return void
    IB/srp: Create an insecure all physical rkey only if needed
    ...

    Linus Torvalds
     

06 Sep, 2015

1 commit


31 Aug, 2015

3 commits

  • The majority of callers never check the return value, and even if they
    did, they can't do anything about a failure.

    All possible failure cases represent a bug in the caller, so just
    WARN_ON inside the function instead.

    This fixes a few random errors:
    net/rd/iw.c infinite loops while it fails. (racing with EBUSY?)

    This also lays the ground work to get rid of error return from the
    drivers. Most drivers do not error, the few that do are broken since
    it cannot be handled.

    Since uverbs can legitimately make use of EBUSY, open code the
    check.

    Signed-off-by: Jason Gunthorpe
    Reviewed-by: Chuck Lever
    Signed-off-by: Doug Ledford

    Jason Gunthorpe
     
  • The pd now has a local_dma_lkey member which completely replaces
    ib_get_dma_mr, use it instead.

    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Doug Ledford

    Jason Gunthorpe
     
  • Signed-off-by: Sagi Grimberg
    Signed-off-by: Doug Ledford

    Sagi Grimberg