11 Jul, 2007

2 commits

  • The IB CM should include the HCA ACK delay when calculating the local
    ACK timeout value to use for RC QPs. If the HCA ACK delay is large
    enough relative to the packet life time, then if it is not taken into
    account, the calculated timeout value ends up being too small, which
    can result in "retry exceeded" errors.

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     
  • MADs sent to the SA should use the the default P_Key (0x7fff/0xffff).
    There's no requirement that the default P_Key is stored at index 0 in
    the local P_Key table, so add code to the sa_query module to look up
    the index of the default P_Key when creating an address handle for the
    SA (which is done any time the P_Key table might change), and use this
    index for all SA queries.

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     

22 May, 2007

2 commits

  • * 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband:
    IB/cm: Improve local id allocation
    IPoIB/cm: Fix SRQ WR leak
    IB/ipoib: Fix typos in error messages
    IB/mlx4: Check if SRQ is full when posting receive
    IB/mlx4: Pass send queue sizes from userspace to kernel
    IB/mlx4: Fix check of opcode in mlx4_ib_post_send()
    mlx4_core: Fix array overrun in dump_dev_cap_flags()
    IB/mlx4: Fix RESET to RESET and RESET to ERROR transitions
    IB/mthca: Fix RESET to ERROR transition
    IB/mlx4: Set GRH:HopLimit when sending globally routed MADs
    IB/mthca: Set GRH:HopLimit when building MLX headers
    IB/mlx4: Fix check of max_qp_dest_rdma in modify QP
    IB/mthca: Fix use-after-free on device restart
    IB/ehca: Return proper error code if register_mr fails
    IPoIB: Handle P_Key table reordering
    IB/core: Use start_port() and end_port()
    IB/core: Add helpers for uncached GID and P_Key searches
    IB/ipath: Fix potential deadlock with multicast spinlocks
    IB/core: Free umem when mm is already gone

    Linus Torvalds
     
  • First thing mm.h does is including sched.h solely for can_do_mlock() inline
    function which has "current" dereference inside. By dealing with can_do_mlock()
    mm.h can be detached from sched.h which is good. See below, why.

    This patch
    a) removes unconditional inclusion of sched.h from mm.h
    b) makes can_do_mlock() normal function in mm/mlock.c
    c) exports can_do_mlock() to not break compilation
    d) adds sched.h inclusions back to files that were getting it indirectly.
    e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
    getting them indirectly

    Net result is:
    a) mm.h users would get less code to open, read, preprocess, parse, ... if
    they don't need sched.h
    b) sched.h stops being dependency for significant number of files:
    on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
    after patch it's only 3744 (-8.3%).

    Cross-compile tested on

    all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
    alpha alpha-up
    arm
    i386 i386-up i386-defconfig i386-allnoconfig
    ia64 ia64-up
    m68k
    mips
    parisc parisc-up
    powerpc powerpc-up
    s390 s390-up
    sparc sparc-up
    sparc64 sparc64-up
    um-x86_64
    x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig

    as well as my two usual configs.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

19 May, 2007

1 commit


09 May, 2007

2 commits

  • When memory pinned with ib_umem_get() is released, ib_umem_release()
    needs to subtract the amount of memory being unpinned from
    mm->locked_vm. However, ib_umem_release() may be called with
    mm->mmap_sem already held for writing if the memory is being released
    as part of an munmap() call, so it is sometimes necessary to defer
    this accounting into a workqueue.

    However, the work struct used to defer this accounting is dynamically
    allocated before it is queued, so there is the possibility of failing
    that allocation. If the allocation fails, then ib_umem_release has no
    choice except to bail out and leave the process with a permanently
    elevated locked_vm.

    Fix this by allocating the structure to defer accounting as part of
    the original struct ib_umem, so there's no possibility of failing a
    later allocation if creating the struct ib_umem and pinning memory
    succeeds.

    Signed-off-by: Roland Dreier

    Roland Dreier
     
  • Export ib_umem_get()/ib_umem_release() and put low-level drivers in
    control of when to call ib_umem_get() to pin and DMA map userspace,
    rather than always calling it in ib_uverbs_reg_mr() before calling the
    low-level driver's reg_user_mr method.

    Also move these functions to be in the ib_core module instead of
    ib_uverbs, so that driver modules using them do not depend on
    ib_uverbs.

    This has a number of advantages:
    - It is better design from the standpoint of making generic code a
    library that can be used or overridden by device-specific code as
    the details of specific devices dictate.
    - Drivers that do not need to pin userspace memory regions do not
    need to take the performance hit of calling ib_mem_get(). For
    example, although I have not tried to implement it in this patch,
    the ipath driver should be able to avoid pinning memory and just
    use copy_{to,from}_user() to access userspace memory regions.
    - Buffers that need special mapping treatment can be identified by
    the low-level driver. For example, it may be possible to solve
    some Altix-specific memory ordering issues with mthca CQs in
    userspace by mapping CQ buffers with extra flags.
    - Drivers that need to pin and DMA map userspace memory for things
    other than memory regions can use ib_umem_get() directly, instead
    of hacks using extra parameters to their reg_phys_mr method. For
    example, the mlx4 driver that is pending being merged needs to pin
    and DMA map QP and CQ buffers, but it does not need to create a
    memory key for these buffers. So the cleanest solution is for mlx4
    to call ib_umem_get() in the create_qp and create_cq methods.

    Signed-off-by: Roland Dreier

    Roland Dreier
     

08 May, 2007

1 commit

  • * 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband:
    IPoIB: Convert to NAPI
    IB: Return "maybe missed event" hint from ib_req_notify_cq()
    IB: Add CQ comp_vector support
    IB/ipath: Fix a race condition when generating ACKs
    IB/ipath: Fix two more spin lock problems
    IB/fmr_pool: Add prefix to all printks
    IB/srp: Set proc_name
    IB/srp: Add orig_dgid sysfs attribute to scsi_host
    IPoIB/cm: Don't crash if remote side uses one QP for both directions
    RDMA/cxgb3: Support for new abort logic
    RDMA/cxgb3: Initialize cpu_idx field in cpl_close_listserv_req message
    RDMA/cxgb3: Fail qp creation if the requested max_inline is too large
    RDMA/cxgb3: Fix TERM codes
    IPoIB/cm: Fix error handling in ipoib_cm_dev_open()
    IB/ipath: Don't corrupt pending mmap list when unmapped objects are freed
    IB/mthca: Work around kernel QP starvation
    IB/ipath: Don't put QP in timeout queue if waiting to send
    IB/ipath: Don't call spin_lock_irq() from interrupt context

    Linus Torvalds
     

07 May, 2007

2 commits

  • The semantics defined by the InfiniBand specification say that
    completion events are only generated when a completions is added to a
    completion queue (CQ) after completion notification is requested. In
    other words, this means that the following race is possible:

    while (CQ is not empty)
    ib_poll_cq(CQ);
    // new completion is added after while loop is exited
    ib_req_notify_cq(CQ);
    // no event is generated for the existing completion

    To close this race, the IB spec recommends doing another poll of the
    CQ after requesting notification.

    However, it is not always possible to arrange code this way (for
    example, we have found that NAPI for IPoIB cannot poll after
    requesting notification). Also, some hardware (eg Mellanox HCAs)
    actually will generate an event for completions added before the call
    to ib_req_notify_cq() -- which is allowed by the spec, since there's
    no way for any upper-layer consumer to know exactly when a completion
    was really added -- so the extra poll of the CQ is just a waste.

    Motivated by this, we add a new flag "IB_CQ_REPORT_MISSED_EVENTS" for
    ib_req_notify_cq() so that it can return a hint about whether the a
    completion may have been added before the request for notification.
    The return value of ib_req_notify_cq() is extended so:

    < 0 means an error occurred while requesting notification
    == 0 means notification was requested successfully, and if
    IB_CQ_REPORT_MISSED_EVENTS was passed in, then no
    events were missed and it is safe to wait for another
    event.
    > 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was
    passed in. It means that the consumer must poll the
    CQ again to make sure it is empty to avoid the race
    described above.

    We add a flag to enable this behavior rather than turning it on
    unconditionally, because checking for missed events may incur
    significant overhead for some low-level drivers, and consumers that
    don't care about the results of this test shouldn't be forced to pay
    for the test.

    Signed-off-by: Roland Dreier

    Roland Dreier
     
  • Add a num_comp_vectors member to struct ib_device and extend
    ib_create_cq() to pass in a comp_vector parameter -- this parallels
    the userspace libibverbs API. Update all hardware drivers to set
    num_comp_vectors to 1 and have all ULPs pass 0 for the comp_vector
    value. Pass the value of num_comp_vectors to userspace rather than
    hard-coding a value of 1.

    We want multiple CQ event vector support (via MSI-X or similar for
    adapters that can generate multiple interrupts), but it's not clear
    how many vectors we want, or how we want to deal with policy issues
    such as how to decide which vector to use or how to set up interrupt
    affinity. This patch is useful for experimenting, since no core
    changes will be necessary when updating a driver to support multiple
    vectors, and we know that we want to make at least these changes
    anyway.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Roland Dreier

    Michael S. Tsirkin
     

03 May, 2007

1 commit

  • I noticed that many source files include while they do
    not appear to need it. Here is an attempt to clean it all up.

    In order to find all possibly affected files, I searched for all
    files including but without any other occurence of "pci"
    or "PCI". I removed the include statement from all of these, then I
    compiled an allmodconfig kernel on both i386 and x86_64 and fixed the
    false positives manually.

    My tests covered 66% of the affected files, so there could be false
    positives remaining. Untested files are:

    arch/alpha/kernel/err_common.c
    arch/alpha/kernel/err_ev6.c
    arch/alpha/kernel/err_ev7.c
    arch/ia64/sn/kernel/huberror.c
    arch/ia64/sn/kernel/xpnet.c
    arch/m68knommu/kernel/dma.c
    arch/mips/lib/iomap.c
    arch/powerpc/platforms/pseries/ras.c
    arch/ppc/8260_io/enet.c
    arch/ppc/8260_io/fcc_enet.c
    arch/ppc/8xx_io/enet.c
    arch/ppc/syslib/ppc4xx_sgdma.c
    arch/sh64/mach-cayman/iomap.c
    arch/xtensa/kernel/xtensa_ksyms.c
    arch/xtensa/platform-iss/setup.c
    drivers/i2c/busses/i2c-at91.c
    drivers/i2c/busses/i2c-mpc.c
    drivers/media/video/saa711x.c
    drivers/misc/hdpuftrs/hdpu_cpustate.c
    drivers/misc/hdpuftrs/hdpu_nexus.c
    drivers/net/au1000_eth.c
    drivers/net/fec_8xx/fec_main.c
    drivers/net/fec_8xx/fec_mii.c
    drivers/net/fs_enet/fs_enet-main.c
    drivers/net/fs_enet/mac-fcc.c
    drivers/net/fs_enet/mac-fec.c
    drivers/net/fs_enet/mac-scc.c
    drivers/net/fs_enet/mii-bitbang.c
    drivers/net/fs_enet/mii-fec.c
    drivers/net/ibm_emac/ibm_emac_core.c
    drivers/net/lasi_82596.c
    drivers/parisc/hppb.c
    drivers/sbus/sbus.c
    drivers/video/g364fb.c
    drivers/video/platinumfb.c
    drivers/video/stifb.c
    drivers/video/valkyriefb.c
    include/asm-arm/arch-ixp4xx/dma.h
    sound/oss/au1550_ac97.c

    I would welcome test reports for these files. I am fine with removing
    the untested files from the patch if the general opinion is that these
    changes aren't safe. The tested part would still be nice to have.

    Note that this patch depends on another header fixup patch I submitted
    to LKML yesterday:
    [PATCH] scatterlist.h needs types.h
    http://lkml.org/lkml/2007/3/01/141

    Signed-off-by: Jean Delvare
    Cc: Badari Pulavarty
    Signed-off-by: Greg Kroah-Hartman

    Jean Delvare
     

17 Feb, 2007

2 commits

  • Extend rdma_cm to support multicast communication. Multicast support
    is added to the existing RDMA_PS_UDP port space, as well as a new
    RDMA_PS_IPOIB port space. The latter port space allows joining the
    multicast groups used by IPoIB, which enables offloading IPoIB traffic
    to a separate QP. The port space determines the signature used in the
    MGID when joining the group. The newly added RDMA_PS_IPOIB also
    allows for unicast operations, similar to RDMA_PS_UDP.

    Supporting the RDMA_PS_IPOIB requires changing how UD QPs are initialized,
    since we can no longer assume that the qkey is constant. This requires
    saving the Q_Key to use when attaching to a device, so that it is
    available when creating the QP. The Q_Key information is exported to
    the user through the existing rdma_init_qp_attr() interface.

    Multicast support is also exported to userspace through the rdma_ucm.

    Signed-off-by: Roland Dreier

    Sean Hefty
     
  • The IB SA tracks multicast join/leave requests on a per port basis and
    does not do any reference counting: if two users of the same port join
    the same group, and one leaves that group, then the SA will remove the
    port from the group even though there is one user who wants to stay a
    member left. Therefore, in order to support multiple users of the
    same multicast group from the same port, we need to perform reference
    counting locally.

    To do this, add an multicast submodule to ib_sa to perform reference
    counting of multicast join/leave operations. Modify ib_ipoib (the
    only in-kernel user of multicast) to use the new interface.

    Signed-off-by: Roland Dreier

    Sean Hefty
     

05 Feb, 2007

3 commits

  • Make the untyped data region in ib_user_mad have type u64 so that it
    gets aligned properly. This avoids alignment faults in ib_umad when
    casting the data field to an rmpp_mad and accessing the 64-bit tid
    field on architectures like ia64.

    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Roland Dreier

    Jason Gunthorpe
     
  • struct ib_wc currently only includes the local QP number: this matches
    the IB spec, but seems mostly useless. The following patch replaces
    this with the pointer to qp itself, and updates all low level drivers
    and all users.

    This has the following advantages:
    - Ability to get a per-qp context through wc->qp->qp_context
    - Existing drivers already have the qp pointer ready in poll cq, so
    this change actually saves a tiny bit (extra memory read) on data path
    (for ehca it would actually be expensive to find the QP pointer when
    polling a CQ, but ehca does not support SRQ so we can leave wc->qp as
    NULL for ehca)
    - Users that need the QP number can still get it through wc->qp->qp_num

    Use case:

    In IPoIB connected mode code, I have a common CQ shared by multiple
    QPs. To track connection usage, I need a way to get at some per-QP
    context upon the completion, and I would like to avoid allocating
    context object per work request just to stick a QP pointer into it.
    With this code, I can just use wc->qp->qp_context.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Roland Dreier

    Michael S. Tsirkin
     
  • uses struct kref, so it should include
    explicitly to avoid hidden include dependencies.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Roland Dreier

    Michael S. Tsirkin
     

16 Dec, 2006

1 commit

  • The ib_dma_alloc_coherent() wrapper uses a u64* for the dma_handle
    parameter, unlike dma_alloc_coherent, which uses dma_addr_t*. This
    means that we need a temporary variable to handle the case when
    ib_dma_alloc_coherent() just falls through directly to
    dma_alloc_coherent() on architectures where sizeof u64 != sizeof
    dma_addr_t.

    Signed-off-by: Roland Dreier

    Roland Dreier
     

14 Dec, 2006

1 commit


13 Dec, 2006

6 commits


30 Nov, 2006

1 commit

  • The ib_cm_establish() function is replaced with a more generic
    ib_cm_notify(). This routine is used to notify the CM that failover
    has occurred, so that future CM messages (LAP, DREQ) reach the remote
    CM. (Currently, we continue to use the original path) This bumps the
    userspace CM ABI.

    New alternate path information is captured when a LAP message is sent
    or received. This allows QP attributes to be initialized for the user
    when a new path is loaded after failover occurs.

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     

03 Nov, 2006

1 commit


31 Oct, 2006

1 commit


23 Sep, 2006

8 commits

  • Relevant SA queries are actually "greater than" / "less than", not
    "greater than or equal" / "less than or equal" as the names imply.
    (See IB spec 1.2 Vol 1, 15.2.5.16 PATHRECORD/Table 205 PathRecord)

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Roland Dreier

    Michael S. Tsirkin
     
  • Document the reject sending and modifying QP to error done in rdma_accept().

    Signed-off-by: Or Gerlitz
    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Or Gerlitz
     
  • Clarify that rdma_destroy_id cancels outstanding asynchronous operations on the
    Associated id.

    Signed-off-by: Or Gerlitz
    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Or Gerlitz
     
  • Require users to register with SA module, to prevent the sa_query
    module text from going away while an SA query callback is still
    running. Update all in-tree users for the new interface.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Michael S. Tsirkin
     
  • Modifications to the existing rdma header files, core files, drivers,
    and ulp files to support iWARP, including:
    - Hook iWARP CM into the build system and use it in rdma_cm.
    - Convert enum ib_node_type to enum rdma_node_type, which includes
    the possibility of RDMA_NODE_RNIC, and update everything for this.

    Signed-off-by: Tom Tucker
    Signed-off-by: Steve Wise
    Signed-off-by: Roland Dreier

    Tom Tucker
     
  • Add an iWARP Connection Manager (CM), which abstracts connection
    management for iWARP devices (RNICs). It is a logical instance of the
    xx_cm where xx is the transport type (ib or iw). The symbols exported
    are used by the transport independent rdma_cm module, and are
    available also for transport dependent ULPs.

    Signed-off-by: Tom Tucker
    Signed-off-by: Steve Wise
    Signed-off-by: Roland Dreier

    Tom Tucker
     
  • Pass a struct ib_udata to the low-level driver's ->modify_srq() and
    ->modify_qp() methods, so that it can get to the device-specific data
    passed in by the userspace driver.

    Signed-off-by: Ralph Campbell
    Signed-off-by: Roland Dreier

    Ralph Campbell
     
  • Add a ib_uverbs_resize_cq_resp.driver_data field so that low-level
    drivers can return data from a resize CQ operation to userspace. Have
    ib_uverbs_resize_cq() only copy the cqe field, to avoid having to bump
    the userspace ABI.

    Signed-off-by: Ralph Campbell
    Signed-off-by: Roland Dreier

    Ralph Campbell
     

19 Sep, 2006

1 commit


25 Jul, 2006

1 commit

  • Validate MADs sent by userspace clients for spec compliance with
    C13-18.1.1 (prevent duplicate requests and responses sent on the
    same port). Without this, RMPP transactions get aborted because
    of duplicate packets.

    This patch is similar to that provided by Jack Morgenstein.

    Signed-off-by: Sean Hefty
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Jack Morgenstein
    Signed-off-by: Roland Dreier

    Sean Hefty
     

15 Jul, 2006

2 commits

  • ib_fmr_pool_map_phys gets the virtual address by pointer but never writes
    there, and users (e.g. srp) seem to assume this and ignore the value
    returned. This patch cleans up the API to get the VA by value, and updates
    all users.

    Signed-off-by: Michael S. Tsirkin
    Acked-by: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     
  • The device address contains unsigned character arrays, which contain raw GID
    addresses. The GIDs may not be naturally aligned, so do not cast them to
    structures or unions.

    Signed-off-by: Sean Hefty
    Signed-off-by: Michael S. Tsirkin
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     

05 Jul, 2006

1 commit

  • * git://git.infradead.org/hdrinstall-2.6:
    Remove export of include/linux/isdn/tpam.h
    Remove and from userspace export
    Restrict headers exported to userspace for SPARC and SPARC64
    Add empty Kbuild files for 'make headers_install' in remaining arches.
    Add Kbuild file for Alpha 'make headers_install'
    Add Kbuild file for SPARC 'make headers_install'
    Add Kbuild file for IA64 'make headers_install'
    Add Kbuild file for S390 'make headers_install'
    Add Kbuild file for i386 'make headers_install'
    Add Kbuild file for x86_64 'make headers_install'
    Add Kbuild file for PowerPC 'make headers_install'
    Add generic Kbuild files for 'make headers_install'
    Basic implementation of 'make headers_check'
    Basic implementation of 'make headers_install'

    Linus Torvalds