02 Nov, 2007

1 commit


10 Oct, 2007

7 commits

  • The IB CM provides a message received acknowledged (MRA) message that
    can be sent to indicate that a REQ or REP message has been received, but
    will require more time to process than the timeout specified by those
    messages. In many cases, the application may not know how long it will
    take to respond to a CM message, but the majority of the time, it will
    usually respond before a retry has been sent. Rather than sending an
    MRA in response to all messages just to handle the case where a longer
    timeout is needed, it is more efficient to queue the MRA for sending in
    case a duplicate message is received.

    This avoids sending an MRA when it is not needed, but limits the number
    of times that a REQ or REP will be resent. It also provides for a
    simpler implementation than generating the MRA based on a timer event.
    (That is, trying to send the MRA after receiving the first REQ or REP if
    a response has not been generated, so that it is received at the remote
    side before a duplicate REQ or REP has been received)

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     
  • The declaration of struct ib_user_mad_reg_req.method_mask[] exported
    to userspace was an array of __u32, but the kernel internally treated
    it as a bitmap made up of longs. This makes a difference for 64-bit
    big-endian kernels, where numbering the bits in an array of__u32 gives:

    |31.....0|63....31|95....64|127...96|

    while numbering the bits in an array of longs gives:

    |63..............0|127............64|

    64-bit userspace can handle this by just treating method_mask[] as an
    array of longs, but 32-bit userspace is really stuck: the meaning of
    the bits in method_mask[] depends on whether the kernel is 32-bit or
    64-bit, and there's no sane way for userspace to know that.

    Fix this by updating to make it clear that
    method_mask[] is an array of longs, and using a compat_ioctl method to
    convert to an array of 64-bit longs to handle the 32-on-64 problem.
    This fixes the interface description to match existing behavior (so
    working binaries continue to work) in almost all situations, and gives
    consistent semantics in the case of 32-bit userspace that can run on
    either a 32-bit or 64-bit kernel, so that the same binary can work for
    both 32-on-32 and 32-on-64 systems.

    Signed-off-by: Roland Dreier

    Roland Dreier
     
  • Add support for setting the P_Key index of sent MADs and getting the
    P_Key index of received MADs. This requires a change to the layout of
    the ABI structure struct ib_user_mad_hdr, so to avoid breaking
    compatibility, we default to the old (unchanged) ABI and add a new
    ioctl IB_USER_MAD_ENABLE_PKEY that allows applications that are aware
    of the new ABI to opt into using it.

    We plan on switching to the new ABI by default in a year or so, and
    this patch adds a warning that is printed when an application uses the
    old ABI, to push people towards converting to the new ABI.

    Signed-off-by: Roland Dreier
    Reviewed-by: Sean Hefty
    Reviewed-by: Hal Rosenstock

    Roland Dreier
     
  • During ib_umem_get(), determine whether all pages from the memory
    region are hugetlb pages and report this in the "hugetlb" member.
    Low-level drivers can use this information if they need it.

    Signed-off-by: Joachim Fenkes
    Signed-off-by: Roland Dreier

    Joachim Fenkes
     
  • Export the ability to set the type of service to user space. Model
    the interface after setsockopt.

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     
  • Provide support to specify a type of service for a communication
    identifier. A new function call is used when dealing with IPv4
    addresses. For IPv6 addresses, the ToS is specified through the
    traffic class field in the sockaddr_in6 structure.

    Signed-off-by: Sean Hefty

    [ The comments Eitan Zahavi and myself have made over the v1 post at

    were fully addressed. ]

    Reviewed-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Sean Hefty
     
  • The QoS annex defines new fields for path records. Add them to the
    ib_sa for consumers that want to use them.

    Signed-off-by: Sean Hefty
    Reviewed-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Sean Hefty
     

04 Aug, 2007

3 commits


11 Jul, 2007

2 commits

  • The IB CM should include the HCA ACK delay when calculating the local
    ACK timeout value to use for RC QPs. If the HCA ACK delay is large
    enough relative to the packet life time, then if it is not taken into
    account, the calculated timeout value ends up being too small, which
    can result in "retry exceeded" errors.

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     
  • MADs sent to the SA should use the the default P_Key (0x7fff/0xffff).
    There's no requirement that the default P_Key is stored at index 0 in
    the local P_Key table, so add code to the sa_query module to look up
    the index of the default P_Key when creating an address handle for the
    SA (which is done any time the P_Key table might change), and use this
    index for all SA queries.

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     

22 May, 2007

2 commits

  • * 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband:
    IB/cm: Improve local id allocation
    IPoIB/cm: Fix SRQ WR leak
    IB/ipoib: Fix typos in error messages
    IB/mlx4: Check if SRQ is full when posting receive
    IB/mlx4: Pass send queue sizes from userspace to kernel
    IB/mlx4: Fix check of opcode in mlx4_ib_post_send()
    mlx4_core: Fix array overrun in dump_dev_cap_flags()
    IB/mlx4: Fix RESET to RESET and RESET to ERROR transitions
    IB/mthca: Fix RESET to ERROR transition
    IB/mlx4: Set GRH:HopLimit when sending globally routed MADs
    IB/mthca: Set GRH:HopLimit when building MLX headers
    IB/mlx4: Fix check of max_qp_dest_rdma in modify QP
    IB/mthca: Fix use-after-free on device restart
    IB/ehca: Return proper error code if register_mr fails
    IPoIB: Handle P_Key table reordering
    IB/core: Use start_port() and end_port()
    IB/core: Add helpers for uncached GID and P_Key searches
    IB/ipath: Fix potential deadlock with multicast spinlocks
    IB/core: Free umem when mm is already gone

    Linus Torvalds
     
  • First thing mm.h does is including sched.h solely for can_do_mlock() inline
    function which has "current" dereference inside. By dealing with can_do_mlock()
    mm.h can be detached from sched.h which is good. See below, why.

    This patch
    a) removes unconditional inclusion of sched.h from mm.h
    b) makes can_do_mlock() normal function in mm/mlock.c
    c) exports can_do_mlock() to not break compilation
    d) adds sched.h inclusions back to files that were getting it indirectly.
    e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
    getting them indirectly

    Net result is:
    a) mm.h users would get less code to open, read, preprocess, parse, ... if
    they don't need sched.h
    b) sched.h stops being dependency for significant number of files:
    on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
    after patch it's only 3744 (-8.3%).

    Cross-compile tested on

    all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
    alpha alpha-up
    arm
    i386 i386-up i386-defconfig i386-allnoconfig
    ia64 ia64-up
    m68k
    mips
    parisc parisc-up
    powerpc powerpc-up
    s390 s390-up
    sparc sparc-up
    sparc64 sparc64-up
    um-x86_64
    x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig

    as well as my two usual configs.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

19 May, 2007

1 commit


09 May, 2007

2 commits

  • When memory pinned with ib_umem_get() is released, ib_umem_release()
    needs to subtract the amount of memory being unpinned from
    mm->locked_vm. However, ib_umem_release() may be called with
    mm->mmap_sem already held for writing if the memory is being released
    as part of an munmap() call, so it is sometimes necessary to defer
    this accounting into a workqueue.

    However, the work struct used to defer this accounting is dynamically
    allocated before it is queued, so there is the possibility of failing
    that allocation. If the allocation fails, then ib_umem_release has no
    choice except to bail out and leave the process with a permanently
    elevated locked_vm.

    Fix this by allocating the structure to defer accounting as part of
    the original struct ib_umem, so there's no possibility of failing a
    later allocation if creating the struct ib_umem and pinning memory
    succeeds.

    Signed-off-by: Roland Dreier

    Roland Dreier
     
  • Export ib_umem_get()/ib_umem_release() and put low-level drivers in
    control of when to call ib_umem_get() to pin and DMA map userspace,
    rather than always calling it in ib_uverbs_reg_mr() before calling the
    low-level driver's reg_user_mr method.

    Also move these functions to be in the ib_core module instead of
    ib_uverbs, so that driver modules using them do not depend on
    ib_uverbs.

    This has a number of advantages:
    - It is better design from the standpoint of making generic code a
    library that can be used or overridden by device-specific code as
    the details of specific devices dictate.
    - Drivers that do not need to pin userspace memory regions do not
    need to take the performance hit of calling ib_mem_get(). For
    example, although I have not tried to implement it in this patch,
    the ipath driver should be able to avoid pinning memory and just
    use copy_{to,from}_user() to access userspace memory regions.
    - Buffers that need special mapping treatment can be identified by
    the low-level driver. For example, it may be possible to solve
    some Altix-specific memory ordering issues with mthca CQs in
    userspace by mapping CQ buffers with extra flags.
    - Drivers that need to pin and DMA map userspace memory for things
    other than memory regions can use ib_umem_get() directly, instead
    of hacks using extra parameters to their reg_phys_mr method. For
    example, the mlx4 driver that is pending being merged needs to pin
    and DMA map QP and CQ buffers, but it does not need to create a
    memory key for these buffers. So the cleanest solution is for mlx4
    to call ib_umem_get() in the create_qp and create_cq methods.

    Signed-off-by: Roland Dreier

    Roland Dreier
     

08 May, 2007

1 commit

  • * 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband:
    IPoIB: Convert to NAPI
    IB: Return "maybe missed event" hint from ib_req_notify_cq()
    IB: Add CQ comp_vector support
    IB/ipath: Fix a race condition when generating ACKs
    IB/ipath: Fix two more spin lock problems
    IB/fmr_pool: Add prefix to all printks
    IB/srp: Set proc_name
    IB/srp: Add orig_dgid sysfs attribute to scsi_host
    IPoIB/cm: Don't crash if remote side uses one QP for both directions
    RDMA/cxgb3: Support for new abort logic
    RDMA/cxgb3: Initialize cpu_idx field in cpl_close_listserv_req message
    RDMA/cxgb3: Fail qp creation if the requested max_inline is too large
    RDMA/cxgb3: Fix TERM codes
    IPoIB/cm: Fix error handling in ipoib_cm_dev_open()
    IB/ipath: Don't corrupt pending mmap list when unmapped objects are freed
    IB/mthca: Work around kernel QP starvation
    IB/ipath: Don't put QP in timeout queue if waiting to send
    IB/ipath: Don't call spin_lock_irq() from interrupt context

    Linus Torvalds
     

07 May, 2007

2 commits

  • The semantics defined by the InfiniBand specification say that
    completion events are only generated when a completions is added to a
    completion queue (CQ) after completion notification is requested. In
    other words, this means that the following race is possible:

    while (CQ is not empty)
    ib_poll_cq(CQ);
    // new completion is added after while loop is exited
    ib_req_notify_cq(CQ);
    // no event is generated for the existing completion

    To close this race, the IB spec recommends doing another poll of the
    CQ after requesting notification.

    However, it is not always possible to arrange code this way (for
    example, we have found that NAPI for IPoIB cannot poll after
    requesting notification). Also, some hardware (eg Mellanox HCAs)
    actually will generate an event for completions added before the call
    to ib_req_notify_cq() -- which is allowed by the spec, since there's
    no way for any upper-layer consumer to know exactly when a completion
    was really added -- so the extra poll of the CQ is just a waste.

    Motivated by this, we add a new flag "IB_CQ_REPORT_MISSED_EVENTS" for
    ib_req_notify_cq() so that it can return a hint about whether the a
    completion may have been added before the request for notification.
    The return value of ib_req_notify_cq() is extended so:

    < 0 means an error occurred while requesting notification
    == 0 means notification was requested successfully, and if
    IB_CQ_REPORT_MISSED_EVENTS was passed in, then no
    events were missed and it is safe to wait for another
    event.
    > 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was
    passed in. It means that the consumer must poll the
    CQ again to make sure it is empty to avoid the race
    described above.

    We add a flag to enable this behavior rather than turning it on
    unconditionally, because checking for missed events may incur
    significant overhead for some low-level drivers, and consumers that
    don't care about the results of this test shouldn't be forced to pay
    for the test.

    Signed-off-by: Roland Dreier

    Roland Dreier
     
  • Add a num_comp_vectors member to struct ib_device and extend
    ib_create_cq() to pass in a comp_vector parameter -- this parallels
    the userspace libibverbs API. Update all hardware drivers to set
    num_comp_vectors to 1 and have all ULPs pass 0 for the comp_vector
    value. Pass the value of num_comp_vectors to userspace rather than
    hard-coding a value of 1.

    We want multiple CQ event vector support (via MSI-X or similar for
    adapters that can generate multiple interrupts), but it's not clear
    how many vectors we want, or how we want to deal with policy issues
    such as how to decide which vector to use or how to set up interrupt
    affinity. This patch is useful for experimenting, since no core
    changes will be necessary when updating a driver to support multiple
    vectors, and we know that we want to make at least these changes
    anyway.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Roland Dreier

    Michael S. Tsirkin
     

03 May, 2007

1 commit

  • I noticed that many source files include while they do
    not appear to need it. Here is an attempt to clean it all up.

    In order to find all possibly affected files, I searched for all
    files including but without any other occurence of "pci"
    or "PCI". I removed the include statement from all of these, then I
    compiled an allmodconfig kernel on both i386 and x86_64 and fixed the
    false positives manually.

    My tests covered 66% of the affected files, so there could be false
    positives remaining. Untested files are:

    arch/alpha/kernel/err_common.c
    arch/alpha/kernel/err_ev6.c
    arch/alpha/kernel/err_ev7.c
    arch/ia64/sn/kernel/huberror.c
    arch/ia64/sn/kernel/xpnet.c
    arch/m68knommu/kernel/dma.c
    arch/mips/lib/iomap.c
    arch/powerpc/platforms/pseries/ras.c
    arch/ppc/8260_io/enet.c
    arch/ppc/8260_io/fcc_enet.c
    arch/ppc/8xx_io/enet.c
    arch/ppc/syslib/ppc4xx_sgdma.c
    arch/sh64/mach-cayman/iomap.c
    arch/xtensa/kernel/xtensa_ksyms.c
    arch/xtensa/platform-iss/setup.c
    drivers/i2c/busses/i2c-at91.c
    drivers/i2c/busses/i2c-mpc.c
    drivers/media/video/saa711x.c
    drivers/misc/hdpuftrs/hdpu_cpustate.c
    drivers/misc/hdpuftrs/hdpu_nexus.c
    drivers/net/au1000_eth.c
    drivers/net/fec_8xx/fec_main.c
    drivers/net/fec_8xx/fec_mii.c
    drivers/net/fs_enet/fs_enet-main.c
    drivers/net/fs_enet/mac-fcc.c
    drivers/net/fs_enet/mac-fec.c
    drivers/net/fs_enet/mac-scc.c
    drivers/net/fs_enet/mii-bitbang.c
    drivers/net/fs_enet/mii-fec.c
    drivers/net/ibm_emac/ibm_emac_core.c
    drivers/net/lasi_82596.c
    drivers/parisc/hppb.c
    drivers/sbus/sbus.c
    drivers/video/g364fb.c
    drivers/video/platinumfb.c
    drivers/video/stifb.c
    drivers/video/valkyriefb.c
    include/asm-arm/arch-ixp4xx/dma.h
    sound/oss/au1550_ac97.c

    I would welcome test reports for these files. I am fine with removing
    the untested files from the patch if the general opinion is that these
    changes aren't safe. The tested part would still be nice to have.

    Note that this patch depends on another header fixup patch I submitted
    to LKML yesterday:
    [PATCH] scatterlist.h needs types.h
    http://lkml.org/lkml/2007/3/01/141

    Signed-off-by: Jean Delvare
    Cc: Badari Pulavarty
    Signed-off-by: Greg Kroah-Hartman

    Jean Delvare
     

17 Feb, 2007

2 commits

  • Extend rdma_cm to support multicast communication. Multicast support
    is added to the existing RDMA_PS_UDP port space, as well as a new
    RDMA_PS_IPOIB port space. The latter port space allows joining the
    multicast groups used by IPoIB, which enables offloading IPoIB traffic
    to a separate QP. The port space determines the signature used in the
    MGID when joining the group. The newly added RDMA_PS_IPOIB also
    allows for unicast operations, similar to RDMA_PS_UDP.

    Supporting the RDMA_PS_IPOIB requires changing how UD QPs are initialized,
    since we can no longer assume that the qkey is constant. This requires
    saving the Q_Key to use when attaching to a device, so that it is
    available when creating the QP. The Q_Key information is exported to
    the user through the existing rdma_init_qp_attr() interface.

    Multicast support is also exported to userspace through the rdma_ucm.

    Signed-off-by: Roland Dreier

    Sean Hefty
     
  • The IB SA tracks multicast join/leave requests on a per port basis and
    does not do any reference counting: if two users of the same port join
    the same group, and one leaves that group, then the SA will remove the
    port from the group even though there is one user who wants to stay a
    member left. Therefore, in order to support multiple users of the
    same multicast group from the same port, we need to perform reference
    counting locally.

    To do this, add an multicast submodule to ib_sa to perform reference
    counting of multicast join/leave operations. Modify ib_ipoib (the
    only in-kernel user of multicast) to use the new interface.

    Signed-off-by: Roland Dreier

    Sean Hefty
     

05 Feb, 2007

3 commits

  • Make the untyped data region in ib_user_mad have type u64 so that it
    gets aligned properly. This avoids alignment faults in ib_umad when
    casting the data field to an rmpp_mad and accessing the 64-bit tid
    field on architectures like ia64.

    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Roland Dreier

    Jason Gunthorpe
     
  • struct ib_wc currently only includes the local QP number: this matches
    the IB spec, but seems mostly useless. The following patch replaces
    this with the pointer to qp itself, and updates all low level drivers
    and all users.

    This has the following advantages:
    - Ability to get a per-qp context through wc->qp->qp_context
    - Existing drivers already have the qp pointer ready in poll cq, so
    this change actually saves a tiny bit (extra memory read) on data path
    (for ehca it would actually be expensive to find the QP pointer when
    polling a CQ, but ehca does not support SRQ so we can leave wc->qp as
    NULL for ehca)
    - Users that need the QP number can still get it through wc->qp->qp_num

    Use case:

    In IPoIB connected mode code, I have a common CQ shared by multiple
    QPs. To track connection usage, I need a way to get at some per-QP
    context upon the completion, and I would like to avoid allocating
    context object per work request just to stick a QP pointer into it.
    With this code, I can just use wc->qp->qp_context.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Roland Dreier

    Michael S. Tsirkin
     
  • uses struct kref, so it should include
    explicitly to avoid hidden include dependencies.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Roland Dreier

    Michael S. Tsirkin
     

16 Dec, 2006

1 commit

  • The ib_dma_alloc_coherent() wrapper uses a u64* for the dma_handle
    parameter, unlike dma_alloc_coherent, which uses dma_addr_t*. This
    means that we need a temporary variable to handle the case when
    ib_dma_alloc_coherent() just falls through directly to
    dma_alloc_coherent() on architectures where sizeof u64 != sizeof
    dma_addr_t.

    Signed-off-by: Roland Dreier

    Roland Dreier
     

14 Dec, 2006

1 commit


13 Dec, 2006

6 commits


30 Nov, 2006

1 commit

  • The ib_cm_establish() function is replaced with a more generic
    ib_cm_notify(). This routine is used to notify the CM that failover
    has occurred, so that future CM messages (LAP, DREQ) reach the remote
    CM. (Currently, we continue to use the original path) This bumps the
    userspace CM ABI.

    New alternate path information is captured when a LAP message is sent
    or received. This allows QP attributes to be initialized for the user
    when a new path is loaded after failover occurs.

    Signed-off-by: Sean Hefty
    Signed-off-by: Roland Dreier

    Sean Hefty
     

03 Nov, 2006

1 commit


31 Oct, 2006

1 commit


23 Sep, 2006

2 commits