26 May, 2011

1 commit

  • The RDMA CM currently infers the QP type from the port space selected
    by the user. In the future (eg with RDMA_PS_IB or XRC), there may not
    be a 1-1 correspondence between port space and QP type. For netlink
    export of RDMA CM state, we want to export the QP type to userspace,
    so it is cleaner to explicitly associate a QP type to an ID.

    Modify rdma_create_id() to allow the user to specify the QP type, and
    use it to make our selections of datagram versus connected mode.

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     

31 Mar, 2011

1 commit


24 Mar, 2011

2 commits

  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Cc: Andy Grover
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • asm-generic/bitops/le.h is only intended to be included directly from
    asm-generic/bitops/ext2-non-atomic.h or asm-generic/bitops/minix-le.h
    which implements generic ext2 or minix bit operations.

    This stops including asm-generic/bitops/le.h directly and use ext2
    non-atomic bit operations instead.

    It seems odd to use ext2_*_bit() on rds, but it will replaced with
    __{set,clear,test}_bit_le() after introducing little endian bit operations
    for all architectures. This indirect step is necessary to maintain
    bisectability for some architectures which have their own little-endian
    bit operations.

    Signed-off-by: Akinobu Mita
    Cc: Andy Grover
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

17 Mar, 2011

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1480 commits)
    bonding: enable netpoll without checking link status
    xfrm: Refcount destination entry on xfrm_lookup
    net: introduce rx_handler results and logic around that
    bonding: get rid of IFF_SLAVE_INACTIVE netdev->priv_flag
    bonding: wrap slave state work
    net: get rid of multiple bond-related netdevice->priv_flags
    bonding: register slave pointer for rx_handler
    be2net: Bump up the version number
    be2net: Copyright notice change. Update to Emulex instead of ServerEngines
    e1000e: fix kconfig for crc32 dependency
    netfilter ebtables: fix xt_AUDIT to work with ebtables
    xen network backend driver
    bonding: Improve syslog message at device creation time
    bonding: Call netif_carrier_off after register_netdevice
    bonding: Incorrect TX queue offset
    net_sched: fix ip_tos2prio
    xfrm: fix __xfrm_route_forward()
    be2net: Fix UDP packet detected status in RX compl
    Phonet: fix aligned-mode pipe socket buffer header reserve
    netxen: support for GbE port settings
    ...

    Fix up conflicts in drivers/staging/brcm80211/brcmsmac/wl_mac80211.c
    with the staging updates.

    Linus Torvalds
     

16 Mar, 2011

1 commit

  • * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix build failure introduced by s/freezeable/freezable/
    workqueue: add system_freezeable_wq
    rds/ib: use system_wq instead of rds_ib_fmr_wq
    net/9p: replace p9_poll_task with a work
    net/9p: use system_wq instead of p9_mux_wq
    xfs: convert to alloc_workqueue()
    reiserfs: make commit_wq use the default concurrency level
    ocfs2: use system_wq instead of ocfs2_quota_wq
    ext4: convert to alloc_workqueue()
    scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path
    scsi/be2iscsi,qla2xxx: convert to alloc_workqueue()
    misc/iwmc3200top: use system_wq instead of dedicated workqueues
    i2o: use alloc_workqueue() instead of create_workqueue()
    acpi: kacpi*_wq don't need WQ_MEM_RECLAIM
    fs/aio: aio_wq isn't used in memory reclaim path
    input/tps6507x-ts: use system_wq instead of dedicated workqueue
    cpufreq: use system_wq instead of dedicated workqueues
    wireless/ipw2x00: use system_wq instead of dedicated workqueues
    arm/omap: use system_wq in mailbox
    workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER

    Linus Torvalds
     

11 Mar, 2011

1 commit


09 Mar, 2011

1 commit

  • Recently had this bug halt reported to me:

    kernel BUG at net/rds/send.c:329!
    Oops: Exception in kernel mode, sig: 5 [#1]
    SMP NR_CPUS=1024 NUMA pSeries
    Modules linked in: rds sunrpc ipv6 dm_mirror dm_region_hash dm_log ibmveth sg
    ext4 jbd2 mbcache sd_mod crc_t10dif ibmvscsic scsi_transport_srp scsi_tgt
    dm_mod [last unloaded: scsi_wait_scan]
    NIP: d000000003ca68f4 LR: d000000003ca67fc CTR: d000000003ca8770
    REGS: c000000175cab980 TRAP: 0700 Not tainted (2.6.32-118.el6.ppc64)
    MSR: 8000000000029032 CR: 44000022 XER: 00000000
    TASK = c00000017586ec90[1896] 'krdsd' THREAD: c000000175ca8000 CPU: 0
    GPR00: 0000000000000150 c000000175cabc00 d000000003cb7340 0000000000002030
    GPR04: ffffffffffffffff 0000000000000030 0000000000000000 0000000000000030
    GPR08: 0000000000000001 0000000000000001 c0000001756b1e30 0000000000010000
    GPR12: d000000003caac90 c000000000fa2500 c0000001742b2858 c0000001742b2a00
    GPR16: c0000001742b2a08 c0000001742b2820 0000000000000001 0000000000000001
    GPR20: 0000000000000040 c0000001742b2814 c000000175cabc70 0800000000000000
    GPR24: 0000000000000004 0200000000000000 0000000000000000 c0000001742b2860
    GPR28: 0000000000000000 c0000001756b1c80 d000000003cb68e8 c0000001742b27b8
    NIP [d000000003ca68f4] .rds_send_xmit+0x4c4/0x8a0 [rds]
    LR [d000000003ca67fc] .rds_send_xmit+0x3cc/0x8a0 [rds]
    Call Trace:
    [c000000175cabc00] [d000000003ca67fc] .rds_send_xmit+0x3cc/0x8a0 [rds]
    (unreliable)
    [c000000175cabd30] [d000000003ca7e64] .rds_send_worker+0x54/0x100 [rds]
    [c000000175cabdb0] [c0000000000b475c] .worker_thread+0x1dc/0x3c0
    [c000000175cabed0] [c0000000000baa9c] .kthread+0xbc/0xd0
    [c000000175cabf90] [c000000000032114] .kernel_thread+0x54/0x70
    Instruction dump:
    4bfffd50 60000000 60000000 39080001 935f004c f91f0040 41820024 813d017c
    7d094a78 7d290074 7929d182 394a0020 40e2ff68 4bffffa4 39200000
    Kernel panic - not syncing: Fatal exception
    Call Trace:
    [c000000175cab560] [c000000000012e04] .show_stack+0x74/0x1c0 (unreliable)
    [c000000175cab610] [c0000000005a365c] .panic+0x80/0x1b4
    [c000000175cab6a0] [c00000000002fbcc] .die+0x21c/0x2a0
    [c000000175cab750] [c000000000030000] ._exception+0x110/0x220
    [c000000175cab910] [c000000000004b9c] program_check_common+0x11c/0x180

    Signed-off-by: David S. Miller

    Neil Horman
     

01 Feb, 2011

1 commit

  • With cmwq, there's no reason to use dedicated rds_ib_fmr_wq - it's not
    in the memory reclaim path and the maximum number of concurrent work
    items is bound by the number of devices. Drop it and use system_wq
    instead. This rds_ib_fmr_init/exit() noops. Both removed.

    Signed-off-by: Tejun Heo
    Cc: Andy Grover

    Tejun Heo
     

20 Jan, 2011

1 commit

  • Clean up some unused macros in net/*.
    1. be left for code change. e.g. PGV_FROM_VMALLOC, PGV_FROM_VMALLOC, KMEM_SAFETYZONE.
    2. never be used since introduced to kernel.
    e.g. P9_RDMA_MAX_SGE, UTIL_CTRL_PKT_SIZE.

    Signed-off-by: Shan Wei
    Acked-by: Sjur Braendeland
    Signed-off-by: David S. Miller

    Shan Wei
     

23 Nov, 2010

1 commit

  • Changed Makefile to use -y instead of -objs
    because -objs is deprecated and not mentioned in
    Documentation/kbuild/makefiles.txt.

    Also, use the ccflags-$ flag instead of EXTRA_CFLAGS because EXTRA_CFLAGS is
    deprecated and should now be switched.

    Last but not least, took out if-conditionals.

    Signed-off-by: Tracey Dent
    Signed-off-by: David S. Miller

    Tracey Dent
     

18 Nov, 2010

1 commit

  • In rds_cmsg_rdma_args(), the user-provided args->nr_local value is
    restricted to less than UINT_MAX. This seems to need a tighter upper
    bound, since the calculation of total iov_size can overflow, resulting
    in a small sock_kmalloc() allocation. This would probably just result
    in walking off the heap and crashing when calling rds_rdma_pages() with
    a high count value. If it somehow doesn't crash here, then memory
    corruption could occur soon after.

    Signed-off-by: Dan Rosenberg
    Signed-off-by: David S. Miller

    Dan Rosenberg
     

09 Nov, 2010

1 commit


04 Nov, 2010

2 commits


31 Oct, 2010

5 commits

  • Even with the previous fix, we still are reading the iovecs once
    to determine SGs needed, and then again later on. Preallocating
    space for sg lists as part of rds_message seemed like a good idea
    but it might be better to not do this. While working to redo that
    code, this patch attempts to protect against userspace rewriting
    the rds_iovec array between the first and second accesses.

    The consequences of this would be either a too-small or too-large
    sg list array. Too large is not an issue. This patch changes all
    callers of message_alloc_sgs to handle running out of preallocated
    sgs, and fail gracefully.

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover
     
  • Change rds_rdma_pages to take a passed-in rds_iovec array instead
    of doing copy_from_user itself.

    Change rds_cmsg_rdma_args to copy rds_iovec array once only. This
    eliminates the possibility of userspace changing it after our
    sanity checks.

    Implement stack-based storage for small numbers of iovecs, based
    on net/socket.c, to save an alloc in the extremely common case.

    Although this patch reduces iovec copies in cmsg_rdma_args to 1,
    we still do another one in rds_rdma_extra_size. Getting rid of
    that one will be trickier, so it'll be a separate patch.

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover
     
  • We don't need to set ret = 0 at the end -- it's initialized to 0.

    Also, don't increment s_send_rdma stat if we're exiting with an
    error.

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover
     
  • rds_cmsg_rdma_args would still return success even if rds_rdma_pages
    returned an error (or overflowed).

    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Andy Grover
     
  • As reported by Thomas Pollet, the rdma page counting can overflow. We
    get the rdma sizes in 64-bit unsigned entities, but then limit it to
    UINT_MAX bytes and shift them down to pages (so with a possible "+1" for
    an unaligned address).

    So each individual page count fits comfortably in an 'unsigned int' (not
    even close to overflowing into signed), but as they are added up, they
    might end up resulting in a signed return value. Which would be wrong.

    Catch the case of tot_pages turning negative, and return the appropriate
    error code.

    Reported-by: Thomas Pollet
    Signed-off-by: Linus Torvalds
    Signed-off-by: Andy Grover
    Signed-off-by: David S. Miller

    Linus Torvalds
     

21 Oct, 2010

2 commits


16 Oct, 2010

1 commit

  • Don't try to "optimize" rds_page_copy_user() by using kmap_atomic() and
    the unsafe atomic user mode accessor functions. It's actually slower
    than the straightforward code on any reasonable modern CPU.

    Back when the code was written (although probably not by the time it was
    actually merged, though), 32-bit x86 may have been the dominant
    architecture. And there kmap_atomic() can be a lot faster than kmap()
    (unless you have very good locality, in which case the virtual address
    caching by kmap() can overcome all the downsides).

    But these days, x86-64 may not be more populous, but it's getting there
    (and if you care about performance, it's definitely already there -
    you'd have upgraded your CPU's already in the last few years). And on
    x86-64, the non-kmap_atomic() version is faster, simply because the code
    is simpler and doesn't have the "re-try page fault" case.

    People with old hardware are not likely to care about RDS anyway, and
    the optimization for the 32-bit case is simply buggy, since it doesn't
    verify the user addresses properly.

    Reported-by: Dan Rosenberg
    Acked-by: Andrew Morton
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Sep, 2010

1 commit


25 Sep, 2010

1 commit

  • We have for each socket :

    One spinlock (sk_slock.slock)
    One rwlock (sk_callback_lock)

    Possible scenarios are :

    (A) (this is used in net/sunrpc/xprtsock.c)
    read_lock(&sk->sk_callback_lock) (without blocking BH)

    spin_lock(&sk->sk_slock.slock);
    ...
    read_lock(&sk->sk_callback_lock);
    ...

    (B)
    write_lock_bh(&sk->sk_callback_lock)
    stuff
    write_unlock_bh(&sk->sk_callback_lock)

    (C)
    spin_lock_bh(&sk->sk_slock)
    ...
    write_lock_bh(&sk->sk_callback_lock)
    stuff
    write_unlock_bh(&sk->sk_callback_lock)
    spin_unlock_bh(&sk->sk_slock)

    This (C) case conflicts with (A) :

    CPU1 [A] CPU2 [C]
    read_lock(callback_lock)
    spin_lock_bh(slock)

    We have one problematic (C) use case in inet_csk_listen_stop() :

    local_bh_disable();
    bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock)
    WARN_ON(sock_owned_by_user(child));
    ...
    sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock)

    lockdep is not happy with this, as reported by Tetsuo Handa

    It seems only way to deal with this is to use read_lock_bh(callbacklock)
    everywhere.

    Thanks to Jarek for pointing a bug in my first attempt and suggesting
    this solution.

    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Signed-off-by: Eric Dumazet
    CC: Jarek Poplawski
    Tested-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Sep, 2010

3 commits


09 Sep, 2010

12 commits

  • Add two CMSGs for masked versions of cswp and fadd. args
    struct modified to use a union for different atomic op type's
    arguments. Change IB to do masked atomic ops. Atomic op type
    in rds_message similarly unionized.

    Signed-off-by: Andy Grover

    Andy Grover
     
  • This prints the constant identifier for work completion status and rdma
    cm event types, like we already do for IB event types.

    A core string array helper is added that each string type uses.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Nothing was canceling the send and receive work that might have been
    queued as a conn was being destroyed.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • rds_conn_shutdown() can return before the connection is shut down when
    it encounters an existing state that it doesn't understand. This lets
    rds_conn_destroy() then start tearing down the conn from under paths
    that are still using it.

    It's more reliable the shutdown work and wait for krdsd to complete the
    shutdown callback. This stopped some hangs I was seeing where krdsd was
    trying to shut down a freed conn.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Right now there's nothing to stop the various paths that use
    rs->rs_transport from racing with rmmod and executing freed transport
    code. The simple fix is to have binding to a transport also hold a
    reference to the transport's module, removing this class of races.

    We already had an unused t_owner field which was set for the modular
    transports and which wasn't set for the built-in loop transport.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • rs_transport is now also used by the rdma paths once the socket is
    bound. We don't need this stale comment to tell us what cscope can.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • rds_conn_destroy() can race with all other modifications of the
    rds_conn_count but it was modifying the count without locking.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • The RDS IB device list wasn't protected by any locking. Traversal in
    both the get_mr and FMR flushing paths could race with additon and
    removal.

    List manipulation is done with RCU primatives and is protected by the
    write side of a rwsem. The list traversal in the get_mr fast path is
    protected by a rcu read critical section. The FMR list traversal is
    more problematic because it can block while traversing the list. We
    protect this with the read side of the rwsem.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • It's nice to not have to go digging in the code to see which event
    occurred. It's easy to throw together a quick array that maps the ib
    event enums to their strings. I didn't see anything in the stack that
    does this translation for us, but I also didn't look very hard.

    Signed-off-by: Zach Brown

    Zach Brown
     
  • Flushing FMRs is somewhat expensive, and is currently kicked off when
    the interrupt handler notices that we are getting low. The result of
    this is that FMR flushing only happens from the interrupt cpus.

    This spreads the load more effectively by triggering flushes just before
    we allocate a new FMR.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This is only needed to keep debugging code from bugging.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • We're seeing bugs today where IB connection shutdown clears the send
    ring while the tasklet is processing completed sends. Implementation
    details cause this to dereference a null pointer. Shutdown needs to
    wait for send completion to stop before tearing down the connection. We
    can't simply wait for the ring to empty because it may contain
    unsignaled sends that will never be processed.

    This patch tracks the number of signaled sends that we've posted and
    waits for them to complete. It also makes sure that the tasklet has
    finished executing.

    Signed-off-by: Zach Brown

    Zach Brown