Eric Lee / smarc-fsl-linux-kernel

04 Nov, 2010

2 commits

8200a59f2 rds: Remove kfreed tcp conn from list ... Browse Code »

All the rds_tcp_connection objects are stored list, but when
being freed it should be removed from there.

Signed-off-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Pavel Emelyanov
2010-11-04 09:50:07 +0800
58c490bab rds: Lost locking in loop connection freeing ... Browse Code »

The conn is removed from list in there and this requires
proper lock protection.

Signed-off-by: Pavel Emelyanov
Signed-off-by: David S. Miller

Pavel Emelyanov
2010-11-04 09:50:06 +0800

31 Oct, 2010

5 commits

d139ff090 RDS: Let rds_message_alloc_sgs() return NULL ... Browse Code »

Even with the previous fix, we still are reading the iovecs once
to determine SGs needed, and then again later on. Preallocating
space for sg lists as part of rds_message seemed like a good idea
but it might be better to not do this. While working to redo that
code, this patch attempts to protect against userspace rewriting
the rds_iovec array between the first and second accesses.

The consequences of this would be either a too-small or too-large
sg list array. Too large is not an issue. This patch changes all
callers of message_alloc_sgs to handle running out of preallocated
sgs, and fail gracefully.

Signed-off-by: Andy Grover
Signed-off-by: David S. Miller

Andy Grover
2010-10-31 07:34:18 +0800
fc8162e3c RDS: Copy rds_iovecs into kernel memory instead of rereading from userspace ... Browse Code »

Change rds_rdma_pages to take a passed-in rds_iovec array instead
of doing copy_from_user itself.

Change rds_cmsg_rdma_args to copy rds_iovec array once only. This
eliminates the possibility of userspace changing it after our
sanity checks.

Implement stack-based storage for small numbers of iovecs, based
on net/socket.c, to save an alloc in the extremely common case.

Although this patch reduces iovec copies in cmsg_rdma_args to 1,
we still do another one in rds_rdma_extra_size. Getting rid of
that one will be trickier, so it'll be a separate patch.

Signed-off-by: Andy Grover
Signed-off-by: David S. Miller

Andy Grover
2010-10-31 07:34:17 +0800
f4a3fc03c RDS: Clean up error handling in rds_cmsg_rdma_args ... Browse Code »

We don't need to set ret = 0 at the end -- it's initialized to 0.

Also, don't increment s_send_rdma stat if we're exiting with an
error.

Signed-off-by: Andy Grover
Signed-off-by: David S. Miller

Andy Grover
2010-10-31 07:34:17 +0800
a09f69c49 RDS: Return -EINVAL if rds_rdma_pages returns an error ... Browse Code »

rds_cmsg_rdma_args would still return success even if rds_rdma_pages
returned an error (or overflowed).

Signed-off-by: Andy Grover
Signed-off-by: David S. Miller

Andy Grover
2010-10-31 07:34:16 +0800
1b1f693d7 net: fix rds_iovec page count overflow ... Browse Code »

As reported by Thomas Pollet, the rdma page counting can overflow. We
get the rdma sizes in 64-bit unsigned entities, but then limit it to
UINT_MAX bytes and shift them down to pages (so with a possible "+1" for
an unaligned address).

So each individual page count fits comfortably in an 'unsigned int' (not
even close to overflowing into signed), but as they are added up, they
might end up resulting in a signed return value. Which would be wrong.

Catch the case of tot_pages turning negative, and return the appropriate
error code.

Reported-by: Thomas Pollet
Signed-off-by: Linus Torvalds
Signed-off-by: Andy Grover
Signed-off-by: David S. Miller

Linus Torvalds
2010-10-31 07:34:16 +0800

21 Oct, 2010

2 commits

2198a10b5 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 ... Browse Code »

Conflicts:
net/core/dev.c

David S. Miller
2010-10-21 23:43:05 +0800
ff51bf841 rds: make local functions/variables static ... Browse Code »

The RDS protocol has lots of functions that should be
declared static. rds_message_get/add_version_extension is
removed since it defined but never used.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

stephen hemminger
2010-10-21 19:26:39 +0800

16 Oct, 2010

1 commit

799c10559 De-pessimize rds_page_copy_user ... Browse Code »

Don't try to "optimize" rds_page_copy_user() by using kmap_atomic() and
the unsafe atomic user mode accessor functions. It's actually slower
than the straightforward code on any reasonable modern CPU.

Back when the code was written (although probably not by the time it was
actually merged, though), 32-bit x86 may have been the dominant
architecture. And there kmap_atomic() can be a lot faster than kmap()
(unless you have very good locality, in which case the virtual address
caching by kmap() can overcome all the downsides).

But these days, x86-64 may not be more populous, but it's getting there
(and if you care about performance, it's definitely already there -
you'd have upgraded your CPU's already in the last few years). And on
x86-64, the non-kmap_atomic() version is faster, simply because the code
is simpler and doesn't have the "re-try page fault" case.

People with old hardware are not likely to care about RDS anyway, and
the optimization for the 32-bit case is simply buggy, since it doesn't
verify the user addresses properly.

Reported-by: Dan Rosenberg
Acked-by: Andrew Morton
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-10-16 02:09:28 +0800

27 Sep, 2010

1 commit

e40051d13 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 ... Browse Code »

Conflicts:
drivers/net/qlcnic/qlcnic_init.c
net/ipv4/ip_output.c

David S. Miller
2010-09-27 16:03:03 +0800

25 Sep, 2010

1 commit

f064af1e5 net: fix a lockdep splat ... Browse Code »

We have for each socket :

One spinlock (sk_slock.slock)
One rwlock (sk_callback_lock)

Possible scenarios are :

(A) (this is used in net/sunrpc/xprtsock.c)
read_lock(&sk->sk_callback_lock) (without blocking BH)

spin_lock(&sk->sk_slock.slock);
...
read_lock(&sk->sk_callback_lock);
...

(B)
write_lock_bh(&sk->sk_callback_lock)
stuff
write_unlock_bh(&sk->sk_callback_lock)

(C)
spin_lock_bh(&sk->sk_slock)
...
write_lock_bh(&sk->sk_callback_lock)
stuff
write_unlock_bh(&sk->sk_callback_lock)
spin_unlock_bh(&sk->sk_slock)

This (C) case conflicts with (A) :

CPU1 [A] CPU2 [C]
read_lock(callback_lock)
spin_lock_bh(slock)

We have one problematic (C) use case in inet_csk_listen_stop() :

local_bh_disable();
bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock)
WARN_ON(sock_owned_by_user(child));
...
sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock)

lockdep is not happy with this, as reported by Tetsuo Handa

It seems only way to deal with this is to use read_lock_bh(callbacklock)
everywhere.

Thanks to Jarek for pointing a bug in my first attempt and suggesting
this solution.

Reported-by: Tetsuo Handa
Tested-by: Tetsuo Handa
Signed-off-by: Eric Dumazet
CC: Jarek Poplawski
Tested-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2010-09-25 13:26:10 +0800

20 Sep, 2010

3 commits

aef3ea33e rds: spin_lock_irq() is not nestable ... Browse Code »

This is basically just a cleanup. IRQs were disabled on the previous
line so we don't need to do it again here. In the current code IRQs
would get turned on one line earlier than intended.

Signed-off-by: Dan Carpenter
Signed-off-by: David S. Miller

Dan Carpenter
2010-09-20 02:59:44 +0800
f4fa7f380 rds: double unlock in rds_ib_cm_handle_connect() ... Browse Code »

We unlock after we goto out.

Signed-off-by: Dan Carpenter
Signed-off-by: David S. Miller

Dan Carpenter
2010-09-20 02:59:44 +0800
9b9d2e00b rds: signedness bug ... Browse Code »

In the original code if the copy_from_user() fails in rds_rdma_pages()
then the error handling fails and we get a stack trace from kmalloc().

Signed-off-by: Dan Carpenter
Signed-off-by: David S. Miller

Dan Carpenter
2010-09-20 02:59:43 +0800

09 Sep, 2010

25 commits

20c72bd5f RDS: Implement masked atomic operations ... Browse Code »

Add two CMSGs for masked versions of cswp and fadd. args
struct modified to use a union for different atomic op type's
arguments. Change IB to do masked atomic ops. Atomic op type
in rds_message similarly unionized.

Signed-off-by: Andy Grover

Andy Grover
2010-09-09 09:16:51 +0800
59f740a6a RDS/IB: print string constants in more places ... Browse Code »

This prints the constant identifier for work completion status and rdma
cm event types, like we already do for IB event types.

A core string array helper is added that each string type uses.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:50 +0800
4518071ac RDS: cancel connection work structs as we shut down ... Browse Code »

Nothing was canceling the send and receive work that might have been
queued as a conn was being destroyed.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:49 +0800
ffcec0e11 RDS: don't call rds_conn_shutdown() from rds_conn_destroy() ... Browse Code »

rds_conn_shutdown() can return before the connection is shut down when
it encounters an existing state that it doesn't understand. This lets
rds_conn_destroy() then start tearing down the conn from under paths
that are still using it.

It's more reliable the shutdown work and wait for krdsd to complete the
shutdown callback. This stopped some hangs I was seeing where krdsd was
trying to shut down a freed conn.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:48 +0800
5adb5bc65 RDS: have sockets get transport module references ... Browse Code »

Right now there's nothing to stop the various paths that use
rs->rs_transport from racing with rmmod and executing freed transport
code. The simple fix is to have binding to a transport also hold a
reference to the transport's module, removing this class of races.

We already had an unused t_owner field which was set for the modular
transports and which wasn't set for the built-in loop transport.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:47 +0800
77510481c RDS: remove old rs_transport comment ... Browse Code »

rs_transport is now also used by the rdma paths once the socket is
bound. We don't need this stale comment to tell us what cscope can.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:46 +0800
fe8ff6b58 RDS: lock rds_conn_count decrement in rds_conn_destroy() ... Browse Code »

rds_conn_destroy() can race with all other modifications of the
rds_conn_count but it was modifying the count without locking.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:45 +0800
ea819867b RDS/IB: protect the list of IB devices ... Browse Code »

The RDS IB device list wasn't protected by any locking. Traversal in
both the get_mr and FMR flushing paths could race with additon and
removal.

List manipulation is done with RCU primatives and is protected by the
write side of a rwsem. The list traversal in the get_mr fast path is
protected by a rcu read critical section. The FMR list traversal is
more problematic because it can block while traversing the list. We
protect this with the read side of the rwsem.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:44 +0800
1bde04a63 RDS/IB: print IB event strings as well as their number ... Browse Code »

It's nice to not have to go digging in the code to see which event
occurred. It's easy to throw together a quick array that maps the ib
event enums to their strings. I didn't see anything in the stack that
does this translation for us, but I also didn't look very hard.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:43 +0800
8576f374a RDS: flush fmrs before allocating new ones ... Browse Code »

Flushing FMRs is somewhat expensive, and is currently kicked off when
the interrupt handler notices that we are getting low. The result of
this is that FMR flushing only happens from the interrupt cpus.

This spreads the load more effectively by triggering flushes just before
we allocate a new FMR.

Signed-off-by: Chris Mason

Chris Mason
2010-09-09 09:16:42 +0800
b4e1da3c9 RDS: properly use sg_init_table ... Browse Code »

This is only needed to keep debugging code from bugging.

Signed-off-by: Chris Mason

Chris Mason
2010-09-09 09:16:41 +0800
f046011cd RDS/IB: track signaled sends ... Browse Code »

We're seeing bugs today where IB connection shutdown clears the send
ring while the tasklet is processing completed sends. Implementation
details cause this to dereference a null pointer. Shutdown needs to
wait for send completion to stop before tearing down the connection. We
can't simply wait for the ring to empty because it may contain
unsignaled sends that will never be processed.

This patch tracks the number of signaled sends that we've posted and
waits for them to complete. It also makes sure that the tasklet has
finished executing.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:40 +0800
ef87b7ea3 RDS: remove __init and __exit annotation ... Browse Code »

The trivial amount of memory saved isn't worth the cost of dealing with section
mismatches.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:39 +0800
c20f5b963 RDS/IB: Use SLAB_HWCACHE_ALIGN flag for kmem_cache_create() ... Browse Code »

We are *definitely* counting cycles as closely as DaveM, so
ensure hwcache alignment for our recv ring control structs.

Signed-off-by: Andy Grover

Andy Grover
2010-09-09 09:16:38 +0800
d455ab640 RDS/IB: always process recv completions ... Browse Code »

The recv refill path was leaking fragments because the recv event handler had
marked a ring element as free without freeing its frag. This was happening
because it wasn't processing receives when the conn wasn't marked up or
connecting, as can be the case if it races with rmmod.

Two observations support always processing receives in the callback.

First, buildup should only post receives, thus triggering recv event handler
calls, once it has built up all the state to handle them. Teardown should
destroy the CQ and drain the ring before tearing down the state needed to
process recvs. Both appear to be true today.

Second, this test was fundamentally racy. There is nothing to stop rmmod and
connection destruction from swooping in the moment after the conn state was
sampled but before real receive procesing starts.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:36 +0800
80c51be56 RDS: return to a single-threaded krdsd ... Browse Code »

We were seeing very nasty bugs due to fundamental assumption the current code
makes about concurrent work struct processing. The code simpy isn't able to
handle concurrent connection shutdown work function execution today, for
example, which is very much possible once a multi-threaded krdsd was
introduced. The problem compounds as additional work structs are added to the
mix.

krdsd is no longer perforance critical now that send and receive posting and
FMR flushing are done elsewhere, so the safest fix is to move back to the
single threaded krdsd that the current code was built around.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:35 +0800
515e079da RDS/IB: create a work queue for FMR flushing ... Browse Code »

This patch moves the FMR flushing work in to its own mult-threaded work queue.
This is to maintain performance in preparation for returning the main krdsd
work queue back to a single threaded work queue to avoid deep-rooted
concurrency bugs.

This is also good because it further separates FMRs, which might be removed
some day, from the rest of the code base.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:34 +0800
8aeb1ba66 RDS/IB: destroy connections on rmmod ... Browse Code »

IB connections were not being destroyed during rmmod.

First, recently IB device removal callback was changed to disconnect
connections that used the removing device rather than destroying them. So
connections with devices during rmmod were not being destroyed.

Second, rds_ib_destroy_nodev_conns() was being called before connections are
disassociated with devices. It would almost never find connections in the
nodev list.

We first get rid of rds_ib_destroy_conns(), which is no longer called, and
refactor the existing caller into the main body of the function and get rid of
the list and lock wrappers.

Then we call rds_ib_destroy_nodev_conns() *after* ib_unregister_client() has
removed the IB device from all the conns and put the conns on the nodev list.

The result is that IB connections are destroyed by rmmod.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:33 +0800
24fa163a4 RDS/IB: wait for IB dev freeing work to finish during rmmod ... Browse Code »

The RDS IB client removal callback can queue work to drop the final reference
to an IB device. We have to make sure that this function has returned before
we complete rmmod or the work threads can try to execute freed code.

Signed-off-by: Zach Brown

Zach Brown
2010-09-09 09:16:32 +0800
b6fb0df12 RDS/IB: Make ib_recv_refill return void ... Browse Code »

Signed-off-by: Andy Grover

Andy Grover
2010-09-09 09:16:31 +0800
fbf4d7e3d RDS: Remove unused XLIST_PTR_TAIL and xlist_protect() ... Browse Code »

Not used.

Signed-off-by: Andy Grover

Andy Grover
2010-09-09 09:16:06 +0800
c9455d999 RDS: whitespace Browse Code »

Andy Grover
2010-09-09 09:15:32 +0800
7a0ff5dbd RDS: use delayed work for the FMR flushes ... Browse Code »

Using a delayed work queue helps us make sure a healthy number of FMRs
have queued up over the limit. It makes for a large improvement in RDMA
iops.

Signed-off-by: Chris Mason

Chris Mason
2010-09-09 09:15:30 +0800
eabb73227 rds: more FMRs are faster ... Browse Code »

When we add more FMRs, we flush them less often and so we go faster.

Signed-off-by: Chris Mason

Chris Mason
2010-09-09 09:15:29 +0800
6fa70da60 rds: recycle FMRs through lockless lists ... Browse Code »

FRM allocation and recycling is performance critical and fairly lock
intensive. The current code has a per connection lock that all
processes bang on and it becomes a major bottleneck on large systems.

This changes things to use a number of cmpxchg based lists instead,
allowing us to go through the whole FMR lifecycle without locking inside
RDS.

Zach Brown pointed out that our usage of cmpxchg for xlist removal is
racey if someone manages to remove and add back an FMR struct into the list
while another CPU can see the FMR's address at the head of the list.

The second CPU might assume the list hasn't changed when in fact any
number of operations might have happened in between the deletion and
reinsertion.

This commit maintains a per cpu count of CPUs that are currently
in xlist removal, and establishes a grace period to make sure that
nobody can see an entry we have just removed from the list.

Signed-off-by: Chris Mason

Chris Mason
2010-09-09 09:15:28 +0800