Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

25 Nov, 2015

1 commit

8c7188b23 RDS: fix race condition when sending a message on unbound socket ... Browse Code »

Sasha's found a NULL pointer dereference in the RDS connection code when
sending a message to an apparently unbound socket. The problem is caused
by the code checking if the socket is bound in rds_sendmsg(), which checks
the rs_bound_addr field without taking a lock on the socket. This opens a
race where rs_bound_addr is temporarily set but where the transport is not
in rds_bind(), leading to a NULL pointer dereference when trying to
dereference 'trans' in __rds_conn_create().

Vegard wrote a reproducer for this issue, so kindly ask him to share if
you're interested.

I cannot reproduce the NULL pointer dereference using Vegard's reproducer
with this patch, whereas I could without.

Complete earlier incomplete fix to CVE-2015-6937:

74e98eb08588 ("RDS: verify the underlying transport exists before creating a connection")

Cc: David S. Miller
Cc: stable@vger.kernel.org

Reviewed-by: Vegard Nossum
Reviewed-by: Sasha Levin
Acked-by: Santosh Shilimkar
Signed-off-by: Quentin Casasnovas
Signed-off-by: David S. Miller

Quentin Casasnovas
2015-11-25 06:20:09 +0800

08 Nov, 2015

2 commits

ad804a0b2 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge second patch-bomb from Andrew Morton:

- most of the rest of MM

- procfs

- lib/ updates

- printk updates

- bitops infrastructure tweaks

- checkpatch updates

- nilfs2 update

- signals

- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...

* emailed patches from Andrew Morton : (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...

Linus Torvalds
2015-11-08 06:32:45 +0800
ab9f2faf8 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma ... Browse Code »

Pull rdma updates from Doug Ledford:
"This is my initial round of 4.4 merge window patches. There are a few
other things I wish to get in for 4.4 that aren't in this pull, as
this represents what has gone through merge/build/run testing and not
what is the last few items for which testing is not yet complete.

- "Checksum offload support in user space" enablement
- Misc cxgb4 fixes, add T6 support
- Misc usnic fixes
- 32 bit build warning fixes
- Misc ocrdma fixes
- Multicast loopback prevention extension
- Extend the GID cache to store and return attributes of GIDs
- Misc iSER updates
- iSER clustering update
- Network NameSpace support for rdma CM
- Work Request cleanup series
- New Memory Registration API"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (76 commits)
IB/core, cma: Make __attribute_const__ declarations sparse-friendly
IB/core: Remove old fast registration API
IB/ipath: Remove fast registration from the code
IB/hfi1: Remove fast registration from the code
RDMA/nes: Remove old FRWR API
IB/qib: Remove old FRWR API
iw_cxgb4: Remove old FRWR API
RDMA/cxgb3: Remove old FRWR API
RDMA/ocrdma: Remove old FRWR API
IB/mlx4: Remove old FRWR API support
IB/mlx5: Remove old FRWR API support
IB/srp: Dont allocate a page vector when using fast_reg
IB/srp: Remove srp_finish_mapping
IB/srp: Convert to new registration API
IB/srp: Split srp_map_sg
RDS/IW: Convert to new memory registration API
svcrdma: Port to new memory registration API
xprtrdma: Port to new memory registration API
iser-target: Port to new memory registration API
IB/iser: Port to new fast registration API
...

Linus Torvalds
2015-11-08 05:33:07 +0800

07 Nov, 2015

1 commit

d0164adc8 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep an… ... Browse Code »
7

…d avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2015-11-07 09:50:42 +0800

03 Nov, 2015

2 commits

7b5654349 RDS: convert bind hash table to re-sizable hashtable ... Browse Code »

To further improve the RDS connection scalabilty on massive systems
where number of sockets grows into tens of thousands of sockets, there
is a need of larger bind hashtable. Pre-allocated 8K or 16K table is
not very flexible in terms of memory utilisation. The rhashtable
infrastructure gives us the flexibility to grow the hashtbable based
on use and also comes up with inbuilt efficient bucket(chain) handling.

Reviewed-by: David Miller
Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar
Signed-off-by: David S. Miller

santosh.shilimkar@oracle.com
2015-11-03 04:36:23 +0800
d3ffaefa1 net: rds: changing the return type from int to void ... Browse Code »

as result of function rds_iw_flush_mr_pool is nowhere checked,
changing its return type from int to void.
also removing the unused variable rc as there is nothing to return

Signed-off-by: Saurabh Sengar
Signed-off-by: David S. Miller

Saurabh Sengar
2015-11-03 04:35:19 +0800

01 Nov, 2015

1 commit

b75ec3af2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2015-11-01 12:15:30 +0800

29 Oct, 2015

3 commits

9ddc87374 RDS/IW: Convert to new memory registration API ... Browse Code »

Get rid of fast_reg page list and its construction.
Instead, just pass the RDS sg list to ib_map_mr_sg
and post the new ib_reg_wr.

This is done both for server IW RDMA_READ registration
and the client remote key registration.

Signed-off-by: Sagi Grimberg
Acked-by: Christoph Hellwig
Acked-by: Santosh Shilimkar
Signed-off-by: Doug Ledford

Sagi Grimberg
2015-10-29 10:27:18 +0800
63e8790d3 Merge branch 'wr-cleanup' into k.o/for-4.4 Browse Code »

Doug Ledford
2015-10-29 10:23:34 +0800
fa20105e0 IB/cma: Add support for network namespaces ... Browse Code »

Add support for network namespaces in the ib_cma module. This is
accomplished by:

1. Adding network namespace parameter for rdma_create_id. This parameter is
used to populate the network namespace field in rdma_id_private.
rdma_create_id keeps a reference on the network namespace.
2. Using the network namespace from the rdma_id instead of init_net inside
of ib_cma, when listening on an ID and when looking for an ID for an
incoming request.
3. Decrementing the reference count for the appropriate network namespace
when calling rdma_destroy_id.

In order to preserve the current behavior init_net is passed when calling
from other modules.

Signed-off-by: Guy Shapiro
Signed-off-by: Haggai Eran
Signed-off-by: Yotam Kenneth
Signed-off-by: Shachar Raindel
Signed-off-by: Doug Ledford

Guy Shapiro
2015-10-29 00:32:48 +0800

28 Oct, 2015

1 commit

8ce675ff3 RDS-TCP: Recover correctly from pskb_pull()/pksb_trim() failure in rds_tcp_data_recv ... Browse Code »

Either of pskb_pull() or pskb_trim() may fail under low memory conditions.
If rds_tcp_data_recv() ignores such failures, the application will
receive corrupted data because the skb has not been correctly
carved to the RDS datagram size.

Avoid this by handling pskb_pull/pskb_trim failure in the same
manner as the skb_clone failure: bail out of rds_tcp_data_recv(), and
retry via the deferred call to rds_send_worker() that gets set up on
ENOMEM from rds_tcp_read_sock()

Signed-off-by: Sowmini Varadhan
Acked-by: Santosh Shilimkar
Signed-off-by: David S. Miller

Sowmini Varadhan
2015-10-28 10:46:34 +0800

19 Oct, 2015

1 commit

7b4b00095 RDS: fix rds-ping deadlock over TCP transport ... Browse Code »

Sowmini found hang with rds-ping while testing RDS over TCP. Its
a corner case and doesn't happen always. The issue is not reproducible
with IB transport. Its clear from below dump why we see it with RDS TCP.

[] do_tcp_setsockopt+0xb5/0x740
[] tcp_setsockopt+0x24/0x30
[] sock_common_setsockopt+0x14/0x20
[] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp]
[] rds_send_xmit+0xd7/0x740 [rds]
[] rds_send_pong+0x142/0x180 [rds]
[] rds_recv_incoming+0x274/0x330 [rds]
[] ? ttwu_queue+0x11e/0x130
[] ? skb_copy_bits+0x6d/0x2c0
[] rds_tcp_data_recv+0x2f0/0x3d0 [rds_tcp]
[] tcp_read_sock+0x96/0x1c0
[] ? rds_tcp_recv_init+0x40/0x40 [rds_tcp]
[] ? sock_def_write_space+0xa0/0xa0
[] rds_tcp_data_ready+0xa1/0xf0 [rds_tcp]
[] tcp_data_queue+0x379/0x5b0
[] ? rds_tcp_write_space+0xbb/0x110 [rds_tcp]
[] tcp_rcv_established+0x2e2/0x6e0
[] tcp_v4_do_rcv+0x122/0x220
[] tcp_v4_rcv+0x867/0x880
[] ip_local_deliver_finish+0xa3/0x220

This happens because rds_send_xmit() chain wants to take
sock_lock which is already taken by tcp_v4_rcv() on its
way to rds_tcp_data_ready(). Commit db6526dcb51b ("RDS: use
rds_send_xmit() state instead of RDS_LL_SEND_FULL") which
was trying to opportunistically finish the send request
in same thread context.

But because of above recursive lock hang with RDS TCP,
the send work from rds_send_pong() needs to deferred to
worker to avoid lock up. Given RDS ping is more of connectivity
test than performance critical path, its should be ok even
for transport like IB.

Reported-by: Sowmini Varadhan
Acked-by: Sowmini Varadhan
Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar
Acked-by: Sowmini Varadhan
Signed-off-by: David S. Miller

santosh.shilimkar@oracle.com
2015-10-19 13:45:55 +0800

13 Oct, 2015

2 commits

241b27195 RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one() ... Browse Code »
6

Consider the following "duelling syn" sequence between two peers A and B:
A B
SYN1 -->

Note that the SYN/ACK has already been sent out by TCP before
rds_tcp_accept_one() gets invoked as part of callbacks.

If the inet_addr(A) is numerically less than inet_addr(B),
the arbitration scheme in rds_tcp_accept_one() will prefer the
TCP connection triggered by SYN1, and will send a CLOSE for the
SYN2 (just after the SYN2ACK was sent).

Since B also follows the same arbitration scheme, it will send the SYN-ACK
for SYN1 that will set up a healthy ESTABLISHED connection on both sides.
B will also get a CLOSE for SYN2, which should result in the cleanup
of the TCP state machine for SYN2, but it should not trigger any
stale RDS-TCP callbacks (such as ->writespace, ->state_change etc),
that would disrupt the progress of the SYN2 based RDS-TCP connection.

Thus the arbitration scheme in rds_tcp_accept_one() should restore
rds_tcp callbacks for the winner before setting them up for the
new accept socket, and also make sure that conn->c_outgoing
is set to 0 so that we do not trigger any reconnect attempts on the
passive side of the tcp socket in the future, in conformance with
commit c82ac7e69efe ("net/rds: RDS-TCP: only initiate reconnect attempt
on outgoing TCP socket.")

Signed-off-by: Sowmini Varadhan
Acked-by: Santosh Shilimkar
Signed-off-by: David S. Miller

Sowmini Varadhan
2015-10-13 19:22:41 +0800
486798001 RDS: Invoke ->laddr_check() in rds_bind() for explicitly bound transports. ... Browse Code »

The IP address passed to rds_bind() should be vetted by the
transport's ->laddr_check() for a previously bound transport.
This needs to be done to avoid cases where, for example,
the application has asked for an IB transport,
but the IP address passed to bind is only usable on
ethernet interfaces.

Signed-off-by: Sowmini Varadhan
Acked-by: Santosh Shilimkar
Signed-off-by: David S. Miller

Sowmini Varadhan
2015-10-13 19:22:40 +0800

08 Oct, 2015

2 commits

91d2f14bc Merge branch 'net/rds/4.3-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux ... Browse Code »

Santosh Shilimkar says:

====================
RDS: connection scalability and performance improvements

[v4]
Re-sending the same patches from v3 again since my repost of
patch 05/14 from v3 was whitespace damaged.

[v3]
Updated patch "[PATCH v2 05/14] RDS: defer the over_batch work to
send worker" as per David Miller's comment [4] to avoid the magic
value usage. Patch now makes use of already available but unused
send_batch_count module parameter. Rest of the patches are same as
earlier version v2 [3]

[v2]:
Dropped "[PATCH 05/15] RDS: increase size of hash-table to 8K" from
earlier version [1]. I plan to address the hash table scalability using
re-sizable hash tables as suggested by David Laight and David Miller [2]

This series addresses RDS connection bottlenecks on massive workloads and
improve the RDMA performance almost by 3X. RDS TCP also gets a small gain
of about 12%.

RDS is being used in massive systems with high scalability where several
hundred thousand end points and tens of thousands of local processes
are operating in tens of thousand sockets. Being RC(reliable connection),
socket bind and release happens very often and any inefficiencies in
bind hash look ups hurts the overall system performance. RDS bin hash-table
uses global spin-lock which is the biggest bottleneck. To make matter worst,
it uses rcu inside global lock for hash buckets.
This is being addressed by simply using per bucket rw lock which makes the
locking simple and very efficient. The hash table size is still an issue and
I plan to address it by using re-sizable hash tables as suggested on the list.

For RDS RDMA improvement, the completion handling is revamped so that we
can do batch completions. Both send and receive completion handlers are
split logically to achieve the same. RDS 8K messages being one of the
key usecase, mr pool is adapted to have the 8K mrs along with default 1M
mrs. And while doing this, few fixes and couple of bottlenecks seen with
rds_sendmsg() are addressed.

Series applies against 4.3-rc1 as well net-next. Its tested on Oracle
hardware with IB fabric for both bcopy as well as RDMA mode. RDS TCP is
tested with iXGB NIC. Like last time, iWARP transport is untested with
these changes. The patchset is also available at below git repo:

git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git net/rds/4.3-v3

As a side note, the IB HCA driver I used for testing misses at least 3
important patches in upstream to see the full blown IB performance and
am hoping to get that in mainline with help of them.
====================

Signed-off-by: David S. Miller

David S. Miller
2015-10-08 19:38:37 +0800
e622f2f4a IB: split struct ib_send_wr ... Browse Code »
5

This patch split up struct ib_send_wr so that all non-trivial verbs
use their own structure which embedds struct ib_send_wr. This dramaticly
shrinks the size of a WR for most common operations:

sizeof(struct ib_send_wr) (old): 96

sizeof(struct ib_send_wr): 48
sizeof(struct ib_rdma_wr): 64
sizeof(struct ib_atomic_wr): 96
sizeof(struct ib_ud_wr): 88
sizeof(struct ib_fast_reg_wr): 88
sizeof(struct ib_bind_mw_wr): 96
sizeof(struct ib_sig_handover_wr): 80

And with Sagi's pending MR rework the fast registration WR will also be
down to a reasonable size:

sizeof(struct ib_fastreg_wr): 64

Signed-off-by: Christoph Hellwig
Reviewed-by: Bart Van Assche [srp, srpt]
Reviewed-by: Chuck Lever [sunrpc]
Tested-by: Haggai Eran
Tested-by: Sagi Grimberg
Tested-by: Steve Wise

Christoph Hellwig
2015-10-08 18:09:10 +0800

06 Oct, 2015

10 commits

067665132 RDS: IB: split mr pool to improve 8K messages performance ... Browse Code »

8K message sizes are pretty important usecase for RDS current
workloads so we make provison to have 8K mrs available from the pool.
Based on number of SG's in the RDS message, we pick a pool to use.

Also to make sure that we don't under utlise mrs when say 8k messages
are dominating which could lead to 8k pull being exhausted, we fall-back
to 1m pool till 8k pool recovers for use.

This helps to at least push ~55 kB/s bidirectional data which
is a nice improvement.

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-06 02:19:02 +0800
41a4e9646 RDS: IB: use max_mr from HCA caps than max_fmr ... Browse Code »

All HCA drivers seems to popullate max_mr caps and few of
them do both max_mr and max_fmr.

Hence update RDS code to make use of max_mr.

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-06 02:19:02 +0800
67161e250 RDS: IB: mark rds_ib_fmr_wq static ... Browse Code »

Fix below warning by marking rds_ib_fmr_wq static

net/rds/ib_rdma.c:87:25: warning: symbol 'rds_ib_fmr_wq' was not declared. Should it be static?

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-06 02:19:02 +0800
26139dc1d RDS: IB: use already available pool handle from ibmr ... Browse Code »

rds_ib_mr already keeps the pool handle which it associates
with. Lets use that instead of round about way of fetching
it from rds_ib_device.

No functional change.

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-06 02:19:02 +0800
2e1d6b813 RDS: IB: fix the rds_ib_fmr_wq kick call ... Browse Code »

RDS IB mr pool has its own workqueue 'rds_ib_fmr_wq', so we need
to use queue_delayed_work() to kick the work. This was hurting
the performance since pool maintenance was less often triggered
from other path.

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-06 02:19:01 +0800
9441c973e RDS: IB: handle rds_ibdev release case instead of crashing the kernel ... Browse Code »

Just in case we are still handling the QP receive completion while the
rds_ibdev is released, drop the connection instead of crashing the kernel.

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-06 02:19:01 +0800
0c28c0450 RDS: IB: split send completion handling and do batch ack ... Browse Code »

Similar to what we did with receive CQ completion handling, we split
the transmit completion handler so that it lets us implement batched
work completion handling.

We re-use the cq_poll routine and makes use of RDS_IB_SEND_OP to
identify the send vs receive completion event handler invocation.

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-06 02:19:01 +0800
f4f943c95 RDS: IB: ack more receive completions to improve performance ... Browse Code »

For better performance, we split the receive completion IRQ handler. That
lets us acknowledge several WCE events in one call. We also limit the WC
to max 32 to avoid latency. Acknowledging several completions in one call
instead of several calls each time will provide better performance since
less mutual exclusion locks are being performed.

In next patch, send completion is also split which re-uses the poll_cq()
and hence the code is moved to ib_cm.c

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-06 02:19:01 +0800
db6526dcb RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL ... Browse Code »
5

In Transport indepedent rds_sendmsg(), we shouldn't make decisions based
on RDS_LL_SEND_FULL which is used to manage the ring for RDMA based
transports. We can safely issue rds_send_xmit() and the using its
return value take decision on deferred work. This will also fix
the scenario where at times we are seeing connections stuck with
the LL_SEND_FULL bit getting set and never cleared.

We kick krdsd after any time we see -ENOMEM or -EAGAIN from the
ring allocation code.

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-06 02:19:01 +0800
4bebdd7a4 RDS: defer the over_batch work to send worker ... Browse Code »

Current process gives up if its send work over the batch limit.
The work queue will get kicked to finish off any other requests.
This fixes remainder condition from commit 443be0e5affe ("RDS: make
sure not to loop forever inside rds_send_xmit").

The restart condition is only for the case where we reached to
over_batch code for some other reason so just retrying again
before giving up.

While at it, make sure we use already available 'send_batch_count'
parameter instead of magic value. The batch count threshold value
of 1024 came via commit 443be0e5affe ("RDS: make sure not to loop
forever inside rds_send_xmit"). The idea is to process as big a
batch as we can but at the same time we don't hold other waiting
processes for send. Hence back-off after the send_batch_count
limit (1024) to avoid soft-lock ups.

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-06 02:18:45 +0800

05 Oct, 2015

3 commits

76b29ef12 RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit ... Browse Code »

For the same reasons as commit 2f5338442425 ("tcp: allow splice() to
build full TSO packets") and commit 35f9c09fe9c7 ("tcp: tcp_sendpages()
should call tcp_push() once"), rds_tcp_xmit may have multiple pages to
send, so use the MSG_MORE and MSG_SENDPAGE_NOTLAST as hints to
tcp_sendpage()

Signed-off-by: Sowmini Varadhan
Signed-off-by: David S. Miller

Sowmini Varadhan
2015-10-05 18:34:53 +0800
1edd6a14d RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune ... Browse Code »
2

Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K)
clobbers efficient use of TSO because it inflates the size_goal
that is computed in tcp_sendmsg/tcp_sendpage and skews packet
latency, and the default values for these parameters actually
results in significantly better performance.

In request-response tests using rds-stress with a packet size of
100K with 16 threads (test parameters -q 100000 -a 256 -t16 -d16)
between a single pair of IP addresses achieves a throughput of
6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under
equivalent conditions on these platforms.

Signed-off-by: Sowmini Varadhan
Signed-off-by: David S. Miller

Sowmini Varadhan
2015-10-05 18:34:53 +0800
3b20fc389 RDS: Use a single TCP socket for both send and receive. ... Browse Code »

Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock
for an incoming connection.") modified rds-tcp so that an incoming SYN
would ignore an existing "client" TCP connection which had the local
port set to the transient port. The motivation for ignoring the existing
"client" connection in f711a6ae was to avoid race conditions and an
endless duel of reconnect attempts triggered by a restart/abort of one
of the nodes in the TCP connection.

However, having separate sockets for active and passive sides
is avoidable, and the simpler model of a single TCP socket for
both send and receives of all RDS connections associated with
that tcp socket makes for easier observability. We avoid the race
conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
The c_outgoing bit is initialized in __rds_conn_create().

A side-effect of re-using the client rds_connection for an incoming
SYN is the potential of encountering duelling SYNs, i.e., we
have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
SYN. The logic to arbitrate this criss-crossing SYN exchange in
rds_tcp_accept_one() has been modified to emulate the BGP state
machine: the smaller IP address should back off from the connection attempt.

Signed-off-by: Sowmini Varadhan
Signed-off-by: David S. Miller

Sowmini Varadhan
2015-10-05 18:34:51 +0800

01 Oct, 2015

4 commits

9b9acde7e RDS: Use per-bucket rw lock for bind hash-table ... Browse Code »

One global lock protecting hash-tables with 1024 buckets isn't
efficient and it shows up in a massive systems with truck
loads of RDS sockets serving multiple databases. The
perf data clearly highlights the contention on the rw
lock in these massive workloads.

When the contention gets worse, the code gets into a state where
it decides to back off on the lock. So while it has disabled interrupts,
it sits and backs off on this lock get. This causes the system to
become sluggish and eventually all sorts of bad things happen.

The simple fix is to move the lock into the hash bucket and
use per-bucket lock to improve the scalability.

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-01 00:43:25 +0800
281269598 RDS: fix rds_sock reference bug while doing bind ... Browse Code »

One need to take rds socket reference while using it and release it
once done with it. rds_add_bind() code path does not do that so
lets fix it.

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-01 00:43:25 +0800
8b0a6b461 RDS: make socket bind/release locking scheme simple and more efficient ... Browse Code »

RDS bind and release locking scheme is very inefficient. It
uses RCU for maintaining the bind hash-table which is great but
it also needs to hold spinlock for [add/remove]_bound(). So
overall usecase, the hash-table concurrent speedup doesn't pay off.
In fact blocking nature of synchronize_rcu() makes the RDS
socket shutdown too slow which hurts RDS performance since
connection shutdown and re-connect happens quite often to
maintain the RC part of the protocol.

So we make the locking scheme simpler and more efficient by
replacing spin_locks with reader/writer locks and getting rid
off rcu for bind hash-table.

In subsequent patch, we also covert the global lock with per-bucket
lock to reduce the global lock contention.

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-01 00:43:24 +0800
59fe46067 RDS: use kfree_rcu in rds_ib_remove_ipaddr ... Browse Code »

synchronize_rcu() slowing down un-necessarily the socket shutdown
path. It is used just kfree() the ip addresses in rds_ib_remove_ipaddr()
which is perfect usecase for kfree_rcu();

So lets use that to gain some speedup.

Signed-off-by: Santosh Shilimkar
Signed-off-by: Santosh Shilimkar

Santosh Shilimkar
2015-10-01 00:43:24 +0800

11 Sep, 2015

1 commit

65c61bc5d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Fix out-of-bounds array access in netfilter ipset, from Jozsef
Kadlecsik.

2) Use correct free operation on netfilter conntrack templates, from
Daniel Borkmann.

3) Fix route leak in SCTP, from Marcelo Ricardo Leitner.

4) Fix sizeof(pointer) in mac80211, from Thierry Reding.

5) Fix cache pointer comparison in ip6mr leading to missed unlock of
mrt_lock. From Richard Laing.

6) rds_conn_lookup() needs to consider network namespace in key
comparison, from Sowmini Varadhan.

7) Fix deadlock in TIPC code wrt broadcast link wakeups, from Kolmakov
Dmitriy.

8) Fix fd leaks in bpf syscall, from Daniel Borkmann.

9) Fix error recovery when installing ipv6 multipath routes, we would
delete the old route before we would know if we could fully commit
to the new set of nexthops. Fix from Roopa Prabhu.

10) Fix run-time suspend problems in r8152, from Hayes Wang.

11) In fec, don't program the MAC address into the chip when the clocks
are gated off. From Fugang Duan.

12) Fix poll behavior for netlink sockets when using rx ring mmap, from
Daniel Borkmann.

13) Don't allocate memory with GFP_KERNEL from get_stats64 in r8169
driver, from Corinna Vinschen.

14) In TCP Cubic congestion control, handle idle periods better where we
are application limited, in order to keep cwnd from growing out of
control. From Eric Dumzet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
tcp_cubic: better follow cubic curve after idle period
tcp: generate CA_EVENT_TX_START on data frames
xen-netfront: respect user provided max_queues
xen-netback: respect user provided max_queues
r8169: Fix sleeping function called during get_stats64, v2
ether: add IEEE 1722 ethertype - TSN
netlink, mmap: fix edge-case leakages in nf queue zero-copy
netlink, mmap: don't walk rx ring on poll if receive queue non-empty
cxgb4: changes for new firmware 1.14.4.0
net: fec: add netif status check before set mac address
r8152: fix the runtime suspend issues
r8152: split DRIVER_VERSION
ipv6: fix ifnullfree.cocci warnings
add microchip LAN88xx phy driver
stmmac: fix check for phydev being open
net: qlcnic: delete redundant memsets
net: mv643xx_eth: use kzalloc
net: jme: use kzalloc() instead of kmalloc+memset
net: cavium: liquidio: use kzalloc in setup_glist()
net: ipv6: use common fib_default_rule_pref
...

Linus Torvalds
2015-09-11 04:53:15 +0800

10 Sep, 2015

1 commit

74e98eb08 RDS: verify the underlying transport exists before creating a connection ... Browse Code »
5

There was no verification that an underlying transport exists when creating
a connection, this would cause dereferencing a NULL ptr.

It might happen on sockets that weren't properly bound before attempting to
send a message, which will cause a NULL ptr deref:

[135546.047719] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[135546.051270] Modules linked in:
[135546.051781] CPU: 4 PID: 15650 Comm: trinity-c4 Not tainted 4.2.0-next-20150902-sasha-00041-gbaa1222-dirty #2527
[135546.053217] task: ffff8800835bc000 ti: ffff8800bc708000 task.ti: ffff8800bc708000
[135546.054291] RIP: __rds_conn_create (net/rds/connection.c:194)
[135546.055666] RSP: 0018:ffff8800bc70fab0 EFLAGS: 00010202
[135546.056457] RAX: dffffc0000000000 RBX: 0000000000000f2c RCX: ffff8800835bc000
[135546.057494] RDX: 0000000000000007 RSI: ffff8800835bccd8 RDI: 0000000000000038
[135546.058530] RBP: ffff8800bc70fb18 R08: 0000000000000001 R09: 0000000000000000
[135546.059556] R10: ffffed014d7a3a23 R11: ffffed014d7a3a21 R12: 0000000000000000
[135546.060614] R13: 0000000000000001 R14: ffff8801ec3d0000 R15: 0000000000000000
[135546.061668] FS: 00007faad4ffb700(0000) GS:ffff880252000000(0000) knlGS:0000000000000000
[135546.062836] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[135546.063682] CR2: 000000000000846a CR3: 000000009d137000 CR4: 00000000000006a0
[135546.064723] Stack:
[135546.065048] ffffffffafe2055c ffffffffafe23fc1 ffffed00493097bf ffff8801ec3d0008
[135546.066247] 0000000000000000 00000000000000d0 0000000000000000 ac194a24c0586342
[135546.067438] 1ffff100178e1f78 ffff880320581b00 ffff8800bc70fdd0 ffff880320581b00
[135546.068629] Call Trace:
[135546.069028] ? __rds_conn_create (include/linux/rcupdate.h:856 net/rds/connection.c:134)
[135546.069989] ? rds_message_copy_from_user (net/rds/message.c:298)
[135546.071021] rds_conn_create_outgoing (net/rds/connection.c:278)
[135546.071981] rds_sendmsg (net/rds/send.c:1058)
[135546.072858] ? perf_trace_lock (include/trace/events/lock.h:38)
[135546.073744] ? lockdep_init (kernel/locking/lockdep.c:3298)
[135546.074577] ? rds_send_drop_to (net/rds/send.c:976)
[135546.075508] ? __might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3795)
[135546.076349] ? __might_fault (mm/memory.c:3795)
[135546.077179] ? rds_send_drop_to (net/rds/send.c:976)
[135546.078114] sock_sendmsg (net/socket.c:611 net/socket.c:620)
[135546.078856] SYSC_sendto (net/socket.c:1657)
[135546.079596] ? SYSC_connect (net/socket.c:1628)
[135546.080510] ? trace_dump_stack (kernel/trace/trace.c:1926)
[135546.081397] ? ring_buffer_unlock_commit (kernel/trace/ring_buffer.c:2479 kernel/trace/ring_buffer.c:2558 kernel/trace/ring_buffer.c:2674)
[135546.082390] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
[135546.083410] ? trace_event_raw_event_sys_enter (include/trace/events/syscalls.h:16)
[135546.084481] ? do_audit_syscall_entry (include/trace/events/syscalls.h:16)
[135546.085438] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
[135546.085515] rds_ib_laddr_check(): addr 36.74.25.172 ret -99 node type -1

Acked-by: Santosh Shilimkar
Signed-off-by: Sasha Levin
Signed-off-by: David S. Miller

Sasha Levin
2015-09-10 03:38:30 +0800

09 Sep, 2015

1 commit

26d2177e9 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma ... Browse Code »

Pull inifiniband/rdma updates from Doug Ledford:
"This is a fairly sizeable set of changes. I've put them through a
decent amount of testing prior to sending the pull request due to
that.

There are still a few fixups that I know are coming, but I wanted to
go ahead and get the big, sizable chunk into your hands sooner rather
than waiting for those last few fixups.

Of note is the fact that this creates what is intended to be a
temporary area in the drivers/staging tree specifically for some
cleanups and additions that are coming for the RDMA stack. We
deprecated two drivers (ipath and amso1100) and are waiting to hear
back if we can deprecate another one (ehca). We also put Intel's new
hfi1 driver into this area because it needs to be refactored and a
transfer library created out of the factored out code, and then it and
the qib driver and the soft-roce driver should all be modified to use
that library.

I expect drivers/staging/rdma to be around for three or four kernel
releases and then to go away as all of the work is completed and final
deletions of deprecated drivers are done.

Summary of changes for 4.3:

- Create drivers/staging/rdma
- Move amso1100 driver to staging/rdma and schedule for deletion
- Move ipath driver to staging/rdma and schedule for deletion
- Add hfi1 driver to staging/rdma and set TODO for move to regular
tree
- Initial support for namespaces to be used on RDMA devices
- Add RoCE GID table handling to the RDMA core caching code
- Infrastructure to support handling of devices with differing read
and write scatter gather capabilities
- Various iSER updates
- Kill off unsafe usage of global mr registrations
- Update SRP driver
- Misc mlx4 driver updates
- Support for the mr_alloc verb
- Support for a netlink interface between kernel and user space cache
daemon to speed path record queries and route resolution
- Ininitial support for safe hot removal of verbs devices"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
IB/ipoib: Suppress warning for send only join failures
IB/ipoib: Clean up send-only multicast joins
IB/srp: Fix possible protection fault
IB/core: Move SM class defines from ib_mad.h to ib_smi.h
IB/core: Remove unnecessary defines from ib_mad.h
IB/hfi1: Add PSM2 user space header to header_install
IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
mlx5: Fix incorrect wc pkey_index assignment for GSI messages
IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
IB/uverbs: reject invalid or unknown opcodes
IB/cxgb4: Fix if statement in pick_local_ip6adddrs
IB/sa: Fix rdma netlink message flags
IB/ucma: HW Device hot-removal support
IB/mlx4_ib: Disassociate support
IB/uverbs: Enable device removal when there are active user space applications
IB/uverbs: Explicitly pass ib_dev to uverbs commands
IB/uverbs: Fix race between ib_uverbs_open and remove_one
IB/uverbs: Fix reference counting usage of event files
IB/core: Make ib_dealloc_pd return void
IB/srp: Create an insecure all physical rkey only if needed
...

Linus Torvalds
2015-09-09 23:33:31 +0800

06 Sep, 2015

1 commit

8f384c017 RDS: rds_conn_lookup() should factor in the struct net for a match ... Browse Code »

Only return a conn if the rds_conn_net(conn) matches the struct
net passed to rds_conn_lookup().

Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen endpoints,
one per netns.")

Signed-off-by: Sowmini Varadhan
Acked-by: Santosh Shilimkar
Signed-off-by: David S. Miller

Sowmini Varadhan
2015-09-06 13:04:58 +0800

31 Aug, 2015

3 commits

7dd78647a IB/core: Make ib_dealloc_pd return void ... Browse Code »
2

The majority of callers never check the return value, and even if they
did, they can't do anything about a failure.

All possible failure cases represent a bug in the caller, so just
WARN_ON inside the function instead.

This fixes a few random errors:
net/rd/iw.c infinite loops while it fails. (racing with EBUSY?)

This also lays the ground work to get rid of error return from the
drivers. Most drivers do not error, the few that do are broken since
it cannot be handled.

Since uverbs can legitimately make use of EBUSY, open code the
check.

Signed-off-by: Jason Gunthorpe
Reviewed-by: Chuck Lever
Signed-off-by: Doug Ledford

Jason Gunthorpe
2015-08-31 06:12:39 +0800
e5580242a rds/ib: Remove ib_get_dma_mr calls ... Browse Code »

The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe
Signed-off-by: Doug Ledford

Jason Gunthorpe
2015-08-31 06:12:36 +0800
fc2799594 RDS: Convert to ib_alloc_mr ... Browse Code »

Signed-off-by: Sagi Grimberg
Signed-off-by: Doug Ledford

Sagi Grimberg
2015-08-31 06:08:46 +0800