27 May, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
    RDMA/cma: Save PID of ID's owner
    RDMA/cma: Add support for netlink statistics export
    RDMA/cma: Pass QP type into rdma_create_id()
    RDMA: Update exported headers list
    RDMA/cma: Export enum cma_state in
    RDMA/nes: Add a check for strict_strtoul()
    RDMA/cxgb3: Don't post zero-byte read if endpoint is going away
    RDMA/cxgb4: Use completion objects for event blocking
    IB/srp: Fix integer -> pointer cast warnings
    IB: Add devnode methods to cm_class and umad_class
    IB/mad: Return EPROTONOSUPPORT when an RDMA device lacks the QP required
    IB/uverbs: Add devnode method to set path/mode
    RDMA/ucma: Add .nodename/.mode to tell userspace where to create device node
    RDMA: Add netlink infrastructure
    RDMA: Add error handling to ib_core_init()

    Linus Torvalds
     

26 May, 2011

5 commits

  • Roland Dreier
     
  • Save the PID associated with an RDMA CM ID for reporting via netlink.

    Signed-off-by: Nir Muchtar
    Signed-off-by: Roland Dreier

    Nir Muchtar
     
  • Add callbacks and data types for statistics export of all current
    devices/ids. The schema for RDMA CM is a series of netlink messages.
    Each one contains an rdma_cm_stat struct. Additionally, two netlink
    attributes are created for the addresses for each message (if
    applicable).

    Their types used are:
    RDMA_NL_RDMA_CM_ATTR_SRC_ADDR (The source address for this ID)
    RDMA_NL_RDMA_CM_ATTR_DST_ADDR (The destination address for this ID)
    sockaddr_* structs are encapsulated within these attributes.

    In other words, every transaction contains a series of messages like:

    -------message 1-------
    struct rdma_cm_id_stats {
    __u32 qp_num;
    __u32 bound_dev_if;
    __u32 port_space;
    __s32 pid;
    __u8 cm_state;
    __u8 node_type;
    __u8 port_num;
    __u8 reserved;
    }
    RDMA_NL_RDMA_CM_ATTR_SRC_ADDR attribute - contains the source address
    RDMA_NL_RDMA_CM_ATTR_DST_ADDR attribute - contains the destination address
    -------end 1-------
    -------message 2-------
    struct rdma_cm_id_stats
    RDMA_NL_RDMA_CM_ATTR_SRC_ADDR attribute
    RDMA_NL_RDMA_CM_ATTR_DST_ADDR attribute
    -------end 2-------

    Signed-off-by: Nir Muchtar
    Signed-off-by: Roland Dreier

    Nir Muchtar
     
  • The RDMA CM currently infers the QP type from the port space selected
    by the user. In the future (eg with RDMA_PS_IB or XRC), there may not
    be a 1-1 correspondence between port space and QP type. For netlink
    export of RDMA CM state, we want to export the QP type to userspace,
    so it is cleaner to explicitly associate a QP type to an ID.

    Modify rdma_create_id() to allow the user to specify the QP type, and
    use it to make our selections of datagram versus connected mode.

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     
  • Move cma.c's internal definition of enum cma_state to enum rdma_cm_state
    in an exported header so that it can be exported via RDMA netlink.

    Signed-off-by: Nir Muchtar
    Signed-off-by: Roland Dreier

    Nir Muchtar
     

25 May, 2011

3 commits

  • It should check if strict_strtoul() succeeds before using
    'wqm_quanta_value'.

    Signed-off-by: Liu Yuan

    [ Convert to kstrtoul() directly while we're here. - Roland ]

    Signed-off-by: Roland Dreier

    Liu Yuan
     
  • tx_ack() wasn't checking the endpoint state and consequently would
    attempt to post the p2p 0B read on an endpoint/QP that is closing or
    aborting. This causes a NULL pointer dereference crash.

    Signed-off-by: Steve Wise
    Signed-off-by: Roland Dreier

    Steve Wise
     
  • There exists a race condition when using wait_queue_head_t objects
    that are declared on the stack. This was being done in a few places
    where we are sending work requests to the FW and awaiting replies, but
    we don't have an endpoint structure with an embedded c4iw_wr_wait
    struct. So the code was allocating it locally on the stack. Bad
    design. The race is:

    1) thread on cpuX declares the wait_queue_head_t on the stack, then
    posts a firmware WR with that wait object ptr as the cookie to be
    returned in the WR reply. This thread will proceed to block in
    wait_event_timeout() but before it does:

    2) An interrupt runs on cpuY with the WR reply. fw6_msg() handles
    this and calls c4iw_wake_up(). c4iw_wake_up() sets the condition
    variable in the c4iw_wr_wait object to TRUE and will call
    wake_up(), but before it calls wake_up():

    3) The thread on cpuX calls c4iw_wait_for_reply(), which calls
    wait_event_timeout(). The wait_event_timeout() macro checks the
    condition variable and returns immediately since it is TRUE. So
    this thread never blocks/sleeps. The function then returns
    effectively deallocating the c4iw_wr_wait object that was on the
    stack.

    4) So at this point cpuY has a pointer to the c4iw_wr_wait object
    that is no longer valid. Further its pointing to a stack frame
    that might now be in use by some other context/thread. So cpuY
    continues execution and calls wake_up() on a ptr to a wait object
    that as been effectively deallocated.

    This race, when it hits, can cause a crash in wake_up(), which I've
    seen under heavy stress. It can also corrupt the referenced stack
    which can cause any number of failures.

    The fix:

    Use struct completion, which supports on-stack declarations.
    Completions use a spinlock around setting the condition to true and
    the wake up so that steps 2 and 4 above are atomic and step 3 can
    never happen in-between.

    Signed-off-by: Steve Wise

    Steve Wise
     

24 May, 2011

5 commits


23 May, 2011

1 commit

  • After discovering that wide use of prefetch on modern CPUs
    could be a net loss instead of a win, net drivers which were
    relying on the implicit inclusion of prefetch.h via the list
    headers showed up in the resulting cleanup fallout. Give
    them an explicit include via the following $0.02 script.

    =========================================
    #!/bin/bash
    MANUAL=""
    for i in `git grep -l 'prefetch(.*)' .` ; do
    grep -q '' $i
    if [ $? = 0 ] ; then
    continue
    fi

    ( echo '?^#include '
    echo .
    echo w
    echo q
    ) | ed -s $i > /dev/null 2>&1
    if [ $? != 0 ]; then
    echo $i needs manual fixup
    MANUAL="$i $MANUAL"
    fi
    done
    echo ------------------- 8\
    [ Fixed up some incorrect #include placements, and added some
    non-network drivers and the fib_trie.c case - Linus ]
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     

21 May, 2011

3 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
    macvlan: fix panic if lowerdev in a bond
    tg3: Add braces around 5906 workaround.
    tg3: Fix NETIF_F_LOOPBACK error
    macvlan: remove one synchronize_rcu() call
    networking: NET_CLS_ROUTE4 depends on INET
    irda: Fix error propagation in ircomm_lmp_connect_response()
    irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
    irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
    rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
    be2net: Kill set but unused variable 'req' in lancer_fw_download()
    irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
    atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
    rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
    rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
    rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
    rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
    pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
    isdn: capi: Use pr_debug() instead of ifdefs.
    tg3: Update version to 3.119
    tg3: Apply rx_discards fix to 5719/5720
    ...

    Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
    as per Davem.

    Linus Torvalds
     
  • Add basic RDMA netlink infrastructure that allows for registration of
    RDMA clients for which data is to be exported and supplies message
    construction callbacks.

    Signed-off-by: Nir Muchtar

    [ Reorganize a few things, add CONFIG_NET dependency. - Roland ]

    Signed-off-by: Roland Dreier

    Roland Dreier
     
  • Fail RDMA midlayer initialization if sysfs setup fails.

    Signed-off-by: Nir Muchtar
    Signed-off-by: Roland Dreier

    Nir Muchtar
     

20 May, 2011

1 commit


19 May, 2011

1 commit


12 May, 2011

2 commits


11 May, 2011

1 commit


10 May, 2011

10 commits

  • The IW_CM_EVENT_STATUS_xxx values were used in only a couple of places;
    cma.c uses -Exxx values instead, and so do the amso1100, cxgb3 and cxgb4
    drivers -- only nes was using the enum values (with the mild consequence
    that all nes connection failures were treated as generic errors rather
    than reported as timeouts or rejections).

    We can fix this confusion by getting rid of enum iw_cm_event_status and
    using a plain int for struct iw_cm_event.status, and converting nes to
    use -Exxx as the other iWARP drivers do.

    This also gets rid of the warning

    drivers/infiniband/core/cma.c: In function 'cma_iw_handler':
    drivers/infiniband/core/cma.c:1333:3: warning: case value '4294967185' not in enumerated type 'enum iw_cm_event_status'
    drivers/infiniband/core/cma.c:1336:3: warning: case value '4294967186' not in enumerated type 'enum iw_cm_event_status'
    drivers/infiniband/core/cma.c:1332:3: warning: case value '4294967192' not in enumerated type 'enum iw_cm_event_status'

    Signed-off-by: Roland Dreier
    Reviewed-by: Steve Wise
    Reviewed-by: Sean Hefty
    Reviewed-by: Faisal Latif

    Roland Dreier
     
  • Commit 44c10138fd4b ("PCI: Change all drivers to use pci_device->revision")
    already converted this driver to using the revision field of struct
    pci_dev but commit bb9171448deb ("IB/ipath: Misc changes to prepare
    for IB7220 introduction") later reverted that change for some strange
    reason. Restore the change.

    Signed-off-by: Sergei Shtylyov
    Acked-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Sergei Shtylyov
     
  • The time limit test now correctly checks against current jiffies to
    avoid the hang.

    Signed-off-by: Mitko Haralanov
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mitko Haralanov
     
  • A few more EEH fixes:

    c4iw_wait_for_reply(): detect fatal EEH condition on timeout and
    return an error.

    The iw_cxgb4 driver was only calling ib_deregister_device() on an EEH
    event followed by a ib_register_device() when the device was
    reinitialized. However, the RDMA core doesn't allow multiple
    iterations of register/deregister by the provider. See
    drivers/infiniband/core/sysfs.c: ib_device_unregister_sysfs() where
    the kobject ref is held until the device is deallocated in
    ib_deallocate_device(). Calling deregister adds this kobj reference,
    and then a subsequent register call will generate a WARN_ON() from the
    kobject subsystem because the kobject is being initialized but is
    already initialized with the ref held.

    So the provider must deregister and dealloc when resetting for an EEH
    event, then alloc/register to re-initialize. To do this, we cannot
    use the device ptr as our ULD handle since it will change with each
    reallocation. This commit adds a ULD context struct which is used as
    the ULD handle, and then contains the device pointer and other state
    needed.

    Signed-off-by: Steve Wise
    Signed-off-by: Roland Dreier

    Steve Wise
     
  • The driver was never really waiting for RDMA_WR/FINI completions
    because the condition variable used to determine if the completion
    happened was never reset, and this condition variable is reused for
    both connection setup and teardown. This causes various driver
    crashes under heavy loads due to releasing resources too early.

    The fix is to use atomic bits to correctly reset the condition
    immediately after the completion is detected.

    Signed-off-by: Steve Wise
    Signed-off-by: Roland Dreier

    Steve Wise
     
  • Parens are missing: '|' has a higher presedence than '?'.

    Signed-off-by: Roel Kluin
    Acked-by: Steve Wise
    Signed-off-by: Roland Dreier

    Roel Kluin
     
  • c4iw_uld_add() must return ERR_PTR() values instead of NULL on failure.

    Signed-off-by: Steve Wise
    Signed-off-by: Roland Dreier

    Steve Wise
     
  • Concurrent ingress CLOSE and ULP ABORT operations causes a crash due
    to a race condition where the close path releases the EP lock and then
    tries to move the QP state to CLOSED. This must be done inside the EP
    lock to avoid the race.

    Signed-off-by: Steve Wise
    Signed-off-by: Roland Dreier

    Steve Wise
     
  • Lustre requires that clients bind to a privileged port number before
    connecting to a remote server. On larger clusters (typically more
    than about 1000 nodes), the number of privileged ports is exhausted,
    resulting in lustre being unusable.

    To handle this, we add support for reusable addresses to the rdma_cm.
    This mimics the behavior of the socket option SO_REUSEADDR. A user
    may set an rdma_cm_id to reuse an address before calling
    rdma_bind_addr() (explicitly or implicitly). If set, other
    rdma_cm_id's may be bound to the same address, provided that they all
    have reuse enabled, and there are no active listens.

    If rdma_listen() is called on an rdma_cm_id that has reuse enabled, it
    will only succeed if there are no other id's bound to that same
    address. The reuse option is exported to user space. The behavior of
    the kernel reuse implementation was verified against that given by
    sockets.

    This patch is derived from a path by Ira Weiny

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Hefty, Sean
     
  • cma_use_port() assumes that the sockaddr is an IPv4 address. Since
    IPv6 addressing is supported (and also to support other address
    families) make the code more generic in its address handling.

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Hefty, Sean
     

04 May, 2011

1 commit


30 Apr, 2011

1 commit

  • This updates the network drivers so that they don't access the
    ethtool_cmd::speed field directly, but use ethtool_cmd_speed()
    instead.

    For most of the drivers, these changes are purely cosmetic and don't
    fix any problem, such as for those 1GbE/10GbE drivers that indirectly
    call their own ethtool get_settings()/mii_ethtool_gset(). The changes
    are meant to enforce code consistency and provide robustness with
    future larger throughputs, at the expense of a few CPU cycles for each
    ethtool operation.

    All drivers compiled with make allyesconfig ion x86_64 have been
    updated.

    Tested: make allyesconfig on x86_64 + e1000e/bnx2x work
    Signed-off-by: David Decotigny
    Signed-off-by: David S. Miller

    David Decotigny
     

27 Apr, 2011

2 commits


20 Apr, 2011

2 commits


31 Mar, 2011

1 commit