04 Nov, 2010

2 commits


31 Oct, 2010

5 commits

  • Even with the previous fix, we still are reading the iovecs once
    to determine SGs needed, and then again later on. Preallocating
    space for sg lists as part of rds_message seemed like a good idea
    but it might be better to not do this. While working to redo that
    code, this patch attempts to protect against userspace rewriting
    the rds_iovec array between the first and second accesses.

    The consequences of this would be either a too-small or too-large
    sg list array. Too large is not an issue. This patch changes all
    callers of message_alloc_sgs to handle running out of preallocated
    sgs, and fail gracefully.

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover
     
  • Change rds_rdma_pages to take a passed-in rds_iovec array instead
    of doing copy_from_user itself.

    Change rds_cmsg_rdma_args to copy rds_iovec array once only. This
    eliminates the possibility of userspace changing it after our
    sanity checks.

    Implement stack-based storage for small numbers of iovecs, based
    on net/socket.c, to save an alloc in the extremely common case.

    Although this patch reduces iovec copies in cmsg_rdma_args to 1,
    we still do another one in rds_rdma_extra_size. Getting rid of
    that one will be trickier, so it'll be a separate patch.

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover
     
  • We don't need to set ret = 0 at the end -- it's initialized to 0.

    Also, don't increment s_send_rdma stat if we're exiting with an
    error.

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover
     
  • rds_cmsg_rdma_args would still return success even if rds_rdma_pages
    returned an error (or overflowed).

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover
     
  • As reported by Thomas Pollet, the rdma page counting can overflow. We
    get the rdma sizes in 64-bit unsigned entities, but then limit it to
    UINT_MAX bytes and shift them down to pages (so with a possible "+1" for
    an unaligned address).

    So each individual page count fits comfortably in an 'unsigned int' (not
    even close to overflowing into signed), but as they are added up, they
    might end up resulting in a signed return value. Which would be wrong.

    Catch the case of tot_pages turning negative, and return the appropriate
    error code.

    Reported-by: Thomas Pollet
    Signed-off-by: Linus Torvalds
    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Linus Torvalds
     

21 Oct, 2010

2 commits


16 Oct, 2010

1 commit

  • Don't try to "optimize" rds_page_copy_user() by using kmap_atomic() and
    the unsafe atomic user mode accessor functions. It's actually slower
    than the straightforward code on any reasonable modern CPU.

    Back when the code was written (although probably not by the time it was
    actually merged, though), 32-bit x86 may have been the dominant
    architecture. And there kmap_atomic() can be a lot faster than kmap()
    (unless you have very good locality, in which case the virtual address
    caching by kmap() can overcome all the downsides).

    But these days, x86-64 may not be more populous, but it's getting there
    (and if you care about performance, it's definitely already there -
    you'd have upgraded your CPU's already in the last few years). And on
    x86-64, the non-kmap_atomic() version is faster, simply because the code
    is simpler and doesn't have the "re-try page fault" case.

    People with old hardware are not likely to care about RDS anyway, and
    the optimization for the 32-bit case is simply buggy, since it doesn't
    verify the user addresses properly.

    Reported-by: Dan Rosenberg
    Acked-by: Andrew Morton
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Sep, 2010

1 commit


25 Sep, 2010

1 commit

  • We have for each socket :

    One spinlock (sk_slock.slock)
    One rwlock (sk_callback_lock)

    Possible scenarios are :

    (A) (this is used in net/sunrpc/xprtsock.c)
    read_lock(&sk->sk_callback_lock) (without blocking BH)

    spin_lock(&sk->sk_slock.slock);
    ...
    read_lock(&sk->sk_callback_lock);
    ...

    (B)
    write_lock_bh(&sk->sk_callback_lock)
    stuff
    write_unlock_bh(&sk->sk_callback_lock)

    (C)
    spin_lock_bh(&sk->sk_slock)
    ...
    write_lock_bh(&sk->sk_callback_lock)
    stuff
    write_unlock_bh(&sk->sk_callback_lock)
    spin_unlock_bh(&sk->sk_slock)

    This (C) case conflicts with (A) :

    CPU1 [A] CPU2 [C]
    read_lock(callback_lock)
    spin_lock_bh(slock)

    We have one problematic (C) use case in inet_csk_listen_stop() :

    local_bh_disable();
    bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock)
    WARN_ON(sock_owned_by_user(child));
    ...
    sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock)

    lockdep is not happy with this, as reported by Tetsuo Handa

    It seems only way to deal with this is to use read_lock_bh(callbacklock)
    everywhere.

    Thanks to Jarek for pointing a bug in my first attempt and suggesting
    this solution.

    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Signed-off-by: Eric Dumazet
    CC: Jarek Poplawski
    Tested-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Sep, 2010

3 commits


09 Sep, 2010

25 commits

  • Add two CMSGs for masked versions of cswp and fadd. args
    struct modified to use a union for different atomic op type's
    arguments. Change IB to do masked atomic ops. Atomic op type
    in rds_message similarly unionized.

    Signed-off-by: Andy Grover

    Andy Grover
     
  • This prints the constant identifier for work completion status and rdma
    cm event types, like we already do for IB event types.

    A core string array helper is added that each string type uses.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Nothing was canceling the send and receive work that might have been
    queued as a conn was being destroyed.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • rds_conn_shutdown() can return before the connection is shut down when
    it encounters an existing state that it doesn't understand. This lets
    rds_conn_destroy() then start tearing down the conn from under paths
    that are still using it.

    It's more reliable the shutdown work and wait for krdsd to complete the
    shutdown callback. This stopped some hangs I was seeing where krdsd was
    trying to shut down a freed conn.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Right now there's nothing to stop the various paths that use
    rs->rs_transport from racing with rmmod and executing freed transport
    code. The simple fix is to have binding to a transport also hold a
    reference to the transport's module, removing this class of races.

    We already had an unused t_owner field which was set for the modular
    transports and which wasn't set for the built-in loop transport.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • rs_transport is now also used by the rdma paths once the socket is
    bound. We don't need this stale comment to tell us what cscope can.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • rds_conn_destroy() can race with all other modifications of the
    rds_conn_count but it was modifying the count without locking.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • The RDS IB device list wasn't protected by any locking. Traversal in
    both the get_mr and FMR flushing paths could race with additon and
    removal.

    List manipulation is done with RCU primatives and is protected by the
    write side of a rwsem. The list traversal in the get_mr fast path is
    protected by a rcu read critical section. The FMR list traversal is
    more problematic because it can block while traversing the list. We
    protect this with the read side of the rwsem.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • It's nice to not have to go digging in the code to see which event
    occurred. It's easy to throw together a quick array that maps the ib
    event enums to their strings. I didn't see anything in the stack that
    does this translation for us, but I also didn't look very hard.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Flushing FMRs is somewhat expensive, and is currently kicked off when
    the interrupt handler notices that we are getting low. The result of
    this is that FMR flushing only happens from the interrupt cpus.

    This spreads the load more effectively by triggering flushes just before
    we allocate a new FMR.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This is only needed to keep debugging code from bugging.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • We're seeing bugs today where IB connection shutdown clears the send
    ring while the tasklet is processing completed sends. Implementation
    details cause this to dereference a null pointer. Shutdown needs to
    wait for send completion to stop before tearing down the connection. We
    can't simply wait for the ring to empty because it may contain
    unsignaled sends that will never be processed.

    This patch tracks the number of signaled sends that we've posted and
    waits for them to complete. It also makes sure that the tasklet has
    finished executing.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • The trivial amount of memory saved isn't worth the cost of dealing with section
    mismatches.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • We are *definitely* counting cycles as closely as DaveM, so
    ensure hwcache alignment for our recv ring control structs.

    Signed-off-by: Andy Grover

    Andy Grover
     
  • The recv refill path was leaking fragments because the recv event handler had
    marked a ring element as free without freeing its frag. This was happening
    because it wasn't processing receives when the conn wasn't marked up or
    connecting, as can be the case if it races with rmmod.

    Two observations support always processing receives in the callback.

    First, buildup should only post receives, thus triggering recv event handler
    calls, once it has built up all the state to handle them. Teardown should
    destroy the CQ and drain the ring before tearing down the state needed to
    process recvs. Both appear to be true today.

    Second, this test was fundamentally racy. There is nothing to stop rmmod and
    connection destruction from swooping in the moment after the conn state was
    sampled but before real receive procesing starts.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • We were seeing very nasty bugs due to fundamental assumption the current code
    makes about concurrent work struct processing. The code simpy isn't able to
    handle concurrent connection shutdown work function execution today, for
    example, which is very much possible once a multi-threaded krdsd was
    introduced. The problem compounds as additional work structs are added to the
    mix.

    krdsd is no longer perforance critical now that send and receive posting and
    FMR flushing are done elsewhere, so the safest fix is to move back to the
    single threaded krdsd that the current code was built around.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • This patch moves the FMR flushing work in to its own mult-threaded work queue.
    This is to maintain performance in preparation for returning the main krdsd
    work queue back to a single threaded work queue to avoid deep-rooted
    concurrency bugs.

    This is also good because it further separates FMRs, which might be removed
    some day, from the rest of the code base.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • IB connections were not being destroyed during rmmod.

    First, recently IB device removal callback was changed to disconnect
    connections that used the removing device rather than destroying them. So
    connections with devices during rmmod were not being destroyed.

    Second, rds_ib_destroy_nodev_conns() was being called before connections are
    disassociated with devices. It would almost never find connections in the
    nodev list.

    We first get rid of rds_ib_destroy_conns(), which is no longer called, and
    refactor the existing caller into the main body of the function and get rid of
    the list and lock wrappers.

    Then we call rds_ib_destroy_nodev_conns() *after* ib_unregister_client() has
    removed the IB device from all the conns and put the conns on the nodev list.

    The result is that IB connections are destroyed by rmmod.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • The RDS IB client removal callback can queue work to drop the final reference
    to an IB device. We have to make sure that this function has returned before
    we complete rmmod or the work threads can try to execute freed code.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Signed-off-by: Andy Grover

    Andy Grover
     
  • Not used.

    Signed-off-by: Andy Grover

    Andy Grover
     
  • Andy Grover
     
  • Using a delayed work queue helps us make sure a healthy number of FMRs
    have queued up over the limit. It makes for a large improvement in RDMA
    iops.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • When we add more FMRs, we flush them less often and so we go faster.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • FRM allocation and recycling is performance critical and fairly lock
    intensive. The current code has a per connection lock that all
    processes bang on and it becomes a major bottleneck on large systems.

    This changes things to use a number of cmpxchg based lists instead,
    allowing us to go through the whole FMR lifecycle without locking inside
    RDS.

    Zach Brown pointed out that our usage of cmpxchg for xlist removal is
    racey if someone manages to remove and add back an FMR struct into the list
    while another CPU can see the FMR's address at the head of the list.

    The second CPU might assume the list hasn't changed when in fact any
    number of operations might have happened in between the deletion and
    reinsertion.

    This commit maintains a per cpu count of CPUs that are currently
    in xlist removal, and establishes a grace period to make sure that
    nobody can see an entry we have just removed from the list.

    Signed-off-by: Chris Mason

    Chris Mason