09 Mar, 2011

1 commit

  • Recently had this bug halt reported to me:

    kernel BUG at net/rds/send.c:329!
    Oops: Exception in kernel mode, sig: 5 [#1]
    SMP NR_CPUS=1024 NUMA pSeries
    Modules linked in: rds sunrpc ipv6 dm_mirror dm_region_hash dm_log ibmveth sg
    ext4 jbd2 mbcache sd_mod crc_t10dif ibmvscsic scsi_transport_srp scsi_tgt
    dm_mod [last unloaded: scsi_wait_scan]
    NIP: d000000003ca68f4 LR: d000000003ca67fc CTR: d000000003ca8770
    REGS: c000000175cab980 TRAP: 0700 Not tainted (2.6.32-118.el6.ppc64)
    MSR: 8000000000029032 CR: 44000022 XER: 00000000
    TASK = c00000017586ec90[1896] 'krdsd' THREAD: c000000175ca8000 CPU: 0
    GPR00: 0000000000000150 c000000175cabc00 d000000003cb7340 0000000000002030
    GPR04: ffffffffffffffff 0000000000000030 0000000000000000 0000000000000030
    GPR08: 0000000000000001 0000000000000001 c0000001756b1e30 0000000000010000
    GPR12: d000000003caac90 c000000000fa2500 c0000001742b2858 c0000001742b2a00
    GPR16: c0000001742b2a08 c0000001742b2820 0000000000000001 0000000000000001
    GPR20: 0000000000000040 c0000001742b2814 c000000175cabc70 0800000000000000
    GPR24: 0000000000000004 0200000000000000 0000000000000000 c0000001742b2860
    GPR28: 0000000000000000 c0000001756b1c80 d000000003cb68e8 c0000001742b27b8
    NIP [d000000003ca68f4] .rds_send_xmit+0x4c4/0x8a0 [rds]
    LR [d000000003ca67fc] .rds_send_xmit+0x3cc/0x8a0 [rds]
    Call Trace:
    [c000000175cabc00] [d000000003ca67fc] .rds_send_xmit+0x3cc/0x8a0 [rds]
    (unreliable)
    [c000000175cabd30] [d000000003ca7e64] .rds_send_worker+0x54/0x100 [rds]
    [c000000175cabdb0] [c0000000000b475c] .worker_thread+0x1dc/0x3c0
    [c000000175cabed0] [c0000000000baa9c] .kthread+0xbc/0xd0
    [c000000175cabf90] [c000000000032114] .kernel_thread+0x54/0x70
    Instruction dump:
    4bfffd50 60000000 60000000 39080001 935f004c f91f0040 41820024 813d017c
    7d094a78 7d290074 7929d182 394a0020 40e2ff68 4bffffa4 39200000
    Kernel panic - not syncing: Fatal exception
    Call Trace:
    [c000000175cab560] [c000000000012e04] .show_stack+0x74/0x1c0 (unreliable)
    [c000000175cab610] [c0000000005a365c] .panic+0x80/0x1b4
    [c000000175cab6a0] [c00000000002fbcc] .die+0x21c/0x2a0
    [c000000175cab750] [c000000000030000] ._exception+0x110/0x220
    [c000000175cab910] [c000000000004b9c] program_check_common+0x11c/0x180

    Signed-off-by: David S. Miller

    Neil Horman
     

23 Nov, 2010

1 commit

  • Changed Makefile to use -y instead of -objs
    because -objs is deprecated and not mentioned in
    Documentation/kbuild/makefiles.txt.

    Also, use the ccflags-$ flag instead of EXTRA_CFLAGS because EXTRA_CFLAGS is
    deprecated and should now be switched.

    Last but not least, took out if-conditionals.

    Signed-off-by: Tracey Dent
    Signed-off-by: David S. Miller

    Tracey Dent
     

18 Nov, 2010

1 commit

  • In rds_cmsg_rdma_args(), the user-provided args->nr_local value is
    restricted to less than UINT_MAX. This seems to need a tighter upper
    bound, since the calculation of total iov_size can overflow, resulting
    in a small sock_kmalloc() allocation. This would probably just result
    in walking off the heap and crashing when calling rds_rdma_pages() with
    a high count value. If it somehow doesn't crash here, then memory
    corruption could occur soon after.

    Signed-off-by: Dan Rosenberg
    Signed-off-by: David S. Miller

    Dan Rosenberg
     

09 Nov, 2010

1 commit


04 Nov, 2010

2 commits


31 Oct, 2010

5 commits

  • Even with the previous fix, we still are reading the iovecs once
    to determine SGs needed, and then again later on. Preallocating
    space for sg lists as part of rds_message seemed like a good idea
    but it might be better to not do this. While working to redo that
    code, this patch attempts to protect against userspace rewriting
    the rds_iovec array between the first and second accesses.

    The consequences of this would be either a too-small or too-large
    sg list array. Too large is not an issue. This patch changes all
    callers of message_alloc_sgs to handle running out of preallocated
    sgs, and fail gracefully.

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover
     
  • Change rds_rdma_pages to take a passed-in rds_iovec array instead
    of doing copy_from_user itself.

    Change rds_cmsg_rdma_args to copy rds_iovec array once only. This
    eliminates the possibility of userspace changing it after our
    sanity checks.

    Implement stack-based storage for small numbers of iovecs, based
    on net/socket.c, to save an alloc in the extremely common case.

    Although this patch reduces iovec copies in cmsg_rdma_args to 1,
    we still do another one in rds_rdma_extra_size. Getting rid of
    that one will be trickier, so it'll be a separate patch.

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover
     
  • We don't need to set ret = 0 at the end -- it's initialized to 0.

    Also, don't increment s_send_rdma stat if we're exiting with an
    error.

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover
     
  • rds_cmsg_rdma_args would still return success even if rds_rdma_pages
    returned an error (or overflowed).

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover
     
  • As reported by Thomas Pollet, the rdma page counting can overflow. We
    get the rdma sizes in 64-bit unsigned entities, but then limit it to
    UINT_MAX bytes and shift them down to pages (so with a possible "+1" for
    an unaligned address).

    So each individual page count fits comfortably in an 'unsigned int' (not
    even close to overflowing into signed), but as they are added up, they
    might end up resulting in a signed return value. Which would be wrong.

    Catch the case of tot_pages turning negative, and return the appropriate
    error code.

    Reported-by: Thomas Pollet
    Signed-off-by: Linus Torvalds
    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Linus Torvalds
     

21 Oct, 2010

2 commits


16 Oct, 2010

1 commit

  • Don't try to "optimize" rds_page_copy_user() by using kmap_atomic() and
    the unsafe atomic user mode accessor functions. It's actually slower
    than the straightforward code on any reasonable modern CPU.

    Back when the code was written (although probably not by the time it was
    actually merged, though), 32-bit x86 may have been the dominant
    architecture. And there kmap_atomic() can be a lot faster than kmap()
    (unless you have very good locality, in which case the virtual address
    caching by kmap() can overcome all the downsides).

    But these days, x86-64 may not be more populous, but it's getting there
    (and if you care about performance, it's definitely already there -
    you'd have upgraded your CPU's already in the last few years). And on
    x86-64, the non-kmap_atomic() version is faster, simply because the code
    is simpler and doesn't have the "re-try page fault" case.

    People with old hardware are not likely to care about RDS anyway, and
    the optimization for the 32-bit case is simply buggy, since it doesn't
    verify the user addresses properly.

    Reported-by: Dan Rosenberg
    Acked-by: Andrew Morton
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Sep, 2010

1 commit


25 Sep, 2010

1 commit

  • We have for each socket :

    One spinlock (sk_slock.slock)
    One rwlock (sk_callback_lock)

    Possible scenarios are :

    (A) (this is used in net/sunrpc/xprtsock.c)
    read_lock(&sk->sk_callback_lock) (without blocking BH)

    spin_lock(&sk->sk_slock.slock);
    ...
    read_lock(&sk->sk_callback_lock);
    ...

    (B)
    write_lock_bh(&sk->sk_callback_lock)
    stuff
    write_unlock_bh(&sk->sk_callback_lock)

    (C)
    spin_lock_bh(&sk->sk_slock)
    ...
    write_lock_bh(&sk->sk_callback_lock)
    stuff
    write_unlock_bh(&sk->sk_callback_lock)
    spin_unlock_bh(&sk->sk_slock)

    This (C) case conflicts with (A) :

    CPU1 [A] CPU2 [C]
    read_lock(callback_lock)
    spin_lock_bh(slock)

    We have one problematic (C) use case in inet_csk_listen_stop() :

    local_bh_disable();
    bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock)
    WARN_ON(sock_owned_by_user(child));
    ...
    sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock)

    lockdep is not happy with this, as reported by Tetsuo Handa

    It seems only way to deal with this is to use read_lock_bh(callbacklock)
    everywhere.

    Thanks to Jarek for pointing a bug in my first attempt and suggesting
    this solution.

    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Signed-off-by: Eric Dumazet
    CC: Jarek Poplawski
    Tested-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Sep, 2010

3 commits


09 Sep, 2010

21 commits

  • Add two CMSGs for masked versions of cswp and fadd. args
    struct modified to use a union for different atomic op type's
    arguments. Change IB to do masked atomic ops. Atomic op type
    in rds_message similarly unionized.

    Signed-off-by: Andy Grover

    Andy Grover
     
  • This prints the constant identifier for work completion status and rdma
    cm event types, like we already do for IB event types.

    A core string array helper is added that each string type uses.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Nothing was canceling the send and receive work that might have been
    queued as a conn was being destroyed.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • rds_conn_shutdown() can return before the connection is shut down when
    it encounters an existing state that it doesn't understand. This lets
    rds_conn_destroy() then start tearing down the conn from under paths
    that are still using it.

    It's more reliable the shutdown work and wait for krdsd to complete the
    shutdown callback. This stopped some hangs I was seeing where krdsd was
    trying to shut down a freed conn.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Right now there's nothing to stop the various paths that use
    rs->rs_transport from racing with rmmod and executing freed transport
    code. The simple fix is to have binding to a transport also hold a
    reference to the transport's module, removing this class of races.

    We already had an unused t_owner field which was set for the modular
    transports and which wasn't set for the built-in loop transport.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • rs_transport is now also used by the rdma paths once the socket is
    bound. We don't need this stale comment to tell us what cscope can.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • rds_conn_destroy() can race with all other modifications of the
    rds_conn_count but it was modifying the count without locking.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • The RDS IB device list wasn't protected by any locking. Traversal in
    both the get_mr and FMR flushing paths could race with additon and
    removal.

    List manipulation is done with RCU primatives and is protected by the
    write side of a rwsem. The list traversal in the get_mr fast path is
    protected by a rcu read critical section. The FMR list traversal is
    more problematic because it can block while traversing the list. We
    protect this with the read side of the rwsem.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • It's nice to not have to go digging in the code to see which event
    occurred. It's easy to throw together a quick array that maps the ib
    event enums to their strings. I didn't see anything in the stack that
    does this translation for us, but I also didn't look very hard.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Flushing FMRs is somewhat expensive, and is currently kicked off when
    the interrupt handler notices that we are getting low. The result of
    this is that FMR flushing only happens from the interrupt cpus.

    This spreads the load more effectively by triggering flushes just before
    we allocate a new FMR.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This is only needed to keep debugging code from bugging.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • We're seeing bugs today where IB connection shutdown clears the send
    ring while the tasklet is processing completed sends. Implementation
    details cause this to dereference a null pointer. Shutdown needs to
    wait for send completion to stop before tearing down the connection. We
    can't simply wait for the ring to empty because it may contain
    unsignaled sends that will never be processed.

    This patch tracks the number of signaled sends that we've posted and
    waits for them to complete. It also makes sure that the tasklet has
    finished executing.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • The trivial amount of memory saved isn't worth the cost of dealing with section
    mismatches.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • We are *definitely* counting cycles as closely as DaveM, so
    ensure hwcache alignment for our recv ring control structs.

    Signed-off-by: Andy Grover

    Andy Grover
     
  • The recv refill path was leaking fragments because the recv event handler had
    marked a ring element as free without freeing its frag. This was happening
    because it wasn't processing receives when the conn wasn't marked up or
    connecting, as can be the case if it races with rmmod.

    Two observations support always processing receives in the callback.

    First, buildup should only post receives, thus triggering recv event handler
    calls, once it has built up all the state to handle them. Teardown should
    destroy the CQ and drain the ring before tearing down the state needed to
    process recvs. Both appear to be true today.

    Second, this test was fundamentally racy. There is nothing to stop rmmod and
    connection destruction from swooping in the moment after the conn state was
    sampled but before real receive procesing starts.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • We were seeing very nasty bugs due to fundamental assumption the current code
    makes about concurrent work struct processing. The code simpy isn't able to
    handle concurrent connection shutdown work function execution today, for
    example, which is very much possible once a multi-threaded krdsd was
    introduced. The problem compounds as additional work structs are added to the
    mix.

    krdsd is no longer perforance critical now that send and receive posting and
    FMR flushing are done elsewhere, so the safest fix is to move back to the
    single threaded krdsd that the current code was built around.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • This patch moves the FMR flushing work in to its own mult-threaded work queue.
    This is to maintain performance in preparation for returning the main krdsd
    work queue back to a single threaded work queue to avoid deep-rooted
    concurrency bugs.

    This is also good because it further separates FMRs, which might be removed
    some day, from the rest of the code base.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • IB connections were not being destroyed during rmmod.

    First, recently IB device removal callback was changed to disconnect
    connections that used the removing device rather than destroying them. So
    connections with devices during rmmod were not being destroyed.

    Second, rds_ib_destroy_nodev_conns() was being called before connections are
    disassociated with devices. It would almost never find connections in the
    nodev list.

    We first get rid of rds_ib_destroy_conns(), which is no longer called, and
    refactor the existing caller into the main body of the function and get rid of
    the list and lock wrappers.

    Then we call rds_ib_destroy_nodev_conns() *after* ib_unregister_client() has
    removed the IB device from all the conns and put the conns on the nodev list.

    The result is that IB connections are destroyed by rmmod.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • The RDS IB client removal callback can queue work to drop the final reference
    to an IB device. We have to make sure that this function has returned before
    we complete rmmod or the work threads can try to execute freed code.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Signed-off-by: Andy Grover

    Andy Grover
     
  • Not used.

    Signed-off-by: Andy Grover

    Andy Grover