20 Dec, 2011

3 commits


07 Dec, 2011

1 commit

  • Commit cfcde11c3d7a ("IB/mlx4: Use flow counters on IBoE ports") added
    code that sets elements of counters[] to -1 if no counter is allocated,
    but then goes ahead and passes every entry to mlx4_counter_free() on
    shutdown. This is a bad idea, especially if MLX4_DEV_CAP_FLAG_COUNTERS
    isn't set so there isn't even an underlying bitmap to free from.

    Tested-by: Sean Hefty
    Cc:
    Signed-off-by: Roland Dreier

    Roland Dreier
     

30 Nov, 2011

3 commits

  • Roland Dreier
     
  • Commit f2c31e32b37 ("net: fix NULL dereferences in check_peer_redir()")
    forgot to take care of infiniband uses of dst neighbours.

    Many thanks to Marc Aurele who provided a nice bug report and feedback.

    Reported-by: Marc Aurele La France
    Signed-off-by: Eric Dumazet
    Cc: David Miller
    Cc:
    Signed-off-by: Roland Dreier

    Eric Dumazet
     
  • This following can occur with ipoib when processing a multicast reponse:

    BUG: soft lockup - CPU#0 stuck for 67s! [ib_mad1:982]
    Modules linked in: ...
    CPU 0:
    Modules linked in: ...
    Pid: 982, comm: ib_mad1 Not tainted 2.6.32-131.0.15.el6.x86_64 #1 ProLiant DL160 G5
    RIP: 0010:[] [] _spin_unlock_irqrestore+0x17/0x20
    RSP: 0018:ffff8802119ed860 EFLAGS: 00000246
    0000000000000004 RBX: ffff8802119ed860 RCX: 000000000000a299
    RDX: ffff88021086c700 RSI: 0000000000000246 RDI: 0000000000000246
    RBP: ffffffff8100bc8e R08: ffff880210ac229c R09: 0000000000000000
    R10: ffff88021278aab8 R11: 0000000000000000 R12: ffff8802119ed860
    R13: ffffffff8100be6e R14: 0000000000000001 R15: 0000000000000003
    FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    CR2: 00000000006d4840 CR3: 0000000209aa5000 CR4: 00000000000406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Call Trace:
    [] ? ipoib_mcast_send+0x157/0x480 [ib_ipoib]
    [] ? apic_timer_interrupt+0xe/0x20
    [] ? apic_timer_interrupt+0xe/0x20
    [] ? ipoib_path_lookup+0x124/0x2d0 [ib_ipoib]
    [] ? ipoib_start_xmit+0x17c/0x430 [ib_ipoib]
    [] ? dev_hard_start_xmit+0x2c8/0x3f0
    [] ? sch_direct_xmit+0x15a/0x1c0
    [] ? dev_queue_xmit+0x388/0x4d0
    [] ? ipoib_mcast_join_finish+0x2c7/0x510 [ib_ipoib]
    [] ? ipoib_mcast_sendonly_join_complete+0x1b8/0x1f0 [ib_ipoib]
    [] ? mcast_work_handler+0x1a6/0x710 [ib_sa]
    [] ? ib_send_mad+0xfe/0x3c0 [ib_mad]
    [] ? ib_get_cached_lmc+0xa3/0xb0 [ib_core]
    [] ? join_handler+0xeb/0x200 [ib_sa]
    [] ? ib_sa_mcmember_rec_callback+0x5c/0xa0 [ib_sa]
    [] ? recv_handler+0x3c/0x70 [ib_sa]
    [] ? ib_mad_completion_handler+0x844/0x9d0 [ib_mad]
    [] ? ib_mad_completion_handler+0x0/0x9d0 [ib_mad]
    [] ? worker_thread+0x170/0x2a0
    [] ? autoremove_wake_function+0x0/0x40
    [] ? worker_thread+0x0/0x2a0
    [] ? kthread+0x96/0xa0
    [] ? child_rip+0xa/0x20

    Coinciding with stack trace is the following message:

    ib0: ib_address_create failed

    The code below in ipoib_mcast_join_finish() will note the above
    failure in the address handle but otherwise continue:

    ah = ipoib_create_ah(dev, priv->pd, &av);
    if (!ah) {
    ipoib_warn(priv, "ib_address_create failed\n");
    } else {

    The while loop at the bottom of ipoib_mcast_join_finish() will attempt
    to send queued multicast packets in mcast->pkt_queue and eventually
    end up in ipoib_mcast_send():

    if (!mcast->ah) {
    if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
    skb_queue_tail(&mcast->pkt_queue, skb);
    else {
    ++dev->stats.tx_dropped;
    dev_kfree_skb_any(skb);
    }

    My read is that the code will requeue the packet and return to the
    ipoib_mcast_join_finish() while loop and the stage is set for the
    "hung" task diagnostic as the while loop never sees a non-NULL ah, and
    will do nothing to resolve.

    There are GFP_ATOMIC allocates in the provider routines, so this is
    possible and should be dealt with.

    The test that induced the failure is associated with a host SM on the
    same server during a shutdown.

    This patch causes ipoib_mcast_join_finish() to exit with an error
    which will flush the queued mcast packets. Nothing is done to unwind
    the QP attached state so that subsequent sends from above will retry
    the join.

    Reviewed-by: Ram Vepa
    Reviewed-by: Gary Leshner
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mike Marciniszyn
     

29 Nov, 2011

3 commits

  • Don't over-schedule QSFP work on driver initialization. It could end
    up being run simultaneously on two different CPUs resulting in bad
    EEPROM reads. In combination with setting the physical IB link state
    prior to the IBC being brought out of reset, this can cause the link
    state machine to start training early with wrong settings.

    Signed-off-by: Mitko Haralanov
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mike Marciniszyn
     
  • Fix logic so that we don't retry with MPAv1 once we have done that
    already. Otherwise, we end up retrying with MPAv1 even when its not
    needed on getting peer aborts - and this could lead to kernel panic.

    Signed-off-by: Kumar Sanghvi
    Signed-off-by: Roland Dreier

    Kumar Sanghvi
     
  • Fix another place in the code where logic dealing with the t4_cqe was
    using the wrong QID. This fixes the counting logic so that it tests
    against the SQ QID instead of the RQ QID when counting RCQES.

    Signed-off by: Jonathan Lallinger
    Signed-off by: Steve Wise
    Signed-off-by: Roland Dreier

    Jonathan Lallinger
     

09 Nov, 2011

1 commit


07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

05 Nov, 2011

4 commits

  • Roland Dreier
     
  • The following panic can occur when flushing a QP:

    RIP: 0010:[] [] qib_send_complete+0x3b/0x190 [ib_qib]
    RSP: 0018:ffff8803cdc6fc90 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: ffff8803d84ba000 RCX: 0000000000000000
    RDX: 0000000000000005 RSI: ffffc90015a53430 RDI: ffff8803d84ba000
    RBP: ffff8803cdc6fce0 R08: ffff8803cdc6fc90 R09: 0000000000000001
    R10: 00000000ffffffff R11: 0000000000000000 R12: ffff8803d84ba0c0
    R13: ffff8803d84ba5cc R14: 0000000000000800 R15: 0000000000000246
    FS: 0000000000000000(0000) GS:ffff880036600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    CR2: 0000000000000034 CR3: 00000003e44f9000 CR4: 00000000000406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process qib/0 (pid: 1350, threadinfo ffff8803cdc6e000, task ffff88042728a100)
    Stack:
    53544c5553455201 0000000100000005 0000000000000000 ffff8803d84ba000
    0000000000000000 0000000000000000 0000000000000000 0000000000000000
    0000000000000000 0000000000000001 ffff8803cdc6fd30 ffffffffa0165d7a
    Call Trace:
    [] qib_make_rc_req+0x36a/0xe80 [ib_qib]
    [] ? qib_make_rc_req+0x0/0xe80 [ib_qib]
    [] qib_do_send+0xf3/0xb60 [ib_qib]
    [] ? thread_return+0x4e/0x777
    [] ? qib_do_send+0x0/0xb60 [ib_qib]
    [] worker_thread+0x170/0x2a0
    [] ? autoremove_wake_function+0x0/0x40
    [] ? worker_thread+0x0/0x2a0
    [] kthread+0x96/0xa0
    [] child_rip+0xa/0x20
    [] ? kthread+0x0/0xa0
    [] ? child_rip+0x0/0x20
    RIP [] qib_send_complete+0x3b/0x190 [ib_qib]

    The RC error state flush logic in qib_make_rc_req() could return all
    of the acked wqes and potentially have emptied the queue. It would
    then unconditionally try return a flush completion via
    qib_send_complete() for an invalid wqe, or worse a valid one that is
    not queued. The panic results when the completion code tries to
    maintain an MR reference count for a NULL MR.

    This fix modifies logic to only send one completion per
    qib_make_rc_req() call and changing the completion status from
    IB_WC_SUCCESS to IB_WC_WR_FLUSH_ERR as the completions progress.

    The outer loop will call as many times as necessary to flush the queue.

    Reviewed-by: Ram Vepa
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mike Marciniszyn
     
  • The current driver never does DMA unmapping on these buffers. Fix that
    by adding DMA unmapping to the task cleanup callback, and DMA mapping to
    the task init function (drop the headers_initialized micro-optimization).

    Signed-off-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Or Gerlitz
     
  • The driver counted on the transactional nature of iSCSI login/text
    flows and used the same buffer for both the request and the response.
    We also went further and did DMA mapping only once, with
    DMA_FROM_DEVICE, which violates the DMA mapping API. Fix that by
    using different buffers, one for requests and one for responses, and
    use the correct DMA mapping direction for each.

    Signed-off-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Or Gerlitz
     

04 Nov, 2011

1 commit

  • The num_free field of mthca_buddy has a type of array of unsigned int
    while it was allocated as an array of pointers. On 64-bit platforms
    this allocates twice more than required. Fix this by allocating the
    correct size for the type.

    This is the same bug just fixed in mlx4 by Eli Cohen .

    Signed-off-by: Roland Dreier

    Roland Dreier
     

02 Nov, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (62 commits)
    mlx4_core: Deprecate log_num_vlan module param
    IB/mlx4: Don't set VLAN in IBoE WQEs' control segment
    IB/mlx4: Enable 4K mtu for IBoE
    RDMA/cxgb4: Mark QP in error before disabling the queue in firmware
    RDMA/cxgb4: Serialize calls to CQ's comp_handler
    RDMA/cxgb3: Serialize calls to CQ's comp_handler
    IB/qib: Fix issue with link states and QSFP cables
    IB/mlx4: Configure extended active speeds
    mlx4_core: Add extended port capabilities support
    IB/qib: Hold links until tuning data is available
    IB/qib: Clean up checkpatch issue
    IB/qib: Remove s_lock around header validation
    IB/qib: Precompute timeout jiffies to optimize latency
    IB/qib: Use RCU for qpn lookup
    IB/qib: Eliminate divide/mod in converting idx to egr buf pointer
    IB/qib: Decode path MTU optimization
    IB/qib: Optimize RC/UC code by IB operation
    IPoIB: Use the right function to do DMA unmap pages
    RDMA/cxgb4: Use correct QID in insert_recv_cqe()
    RDMA/cxgb4: Make sure flush CQ entries are collected on connection close
    ...

    Linus Torvalds
     
  • …sc', 'mlx4', 'misc', 'nes', 'qib' and 'xrc' into for-next

    Roland Dreier
     

01 Nov, 2011

11 commits

  • Some kernel components pin user space memory (infiniband and perf) (by
    increasing the page count) and account that memory as "mlocked".

    The difference between mlocking and pinning is:

    A. mlocked pages are marked with PG_mlocked and are exempt from
    swapping. Page migration may move them around though.
    They are kept on a special LRU list.

    B. Pinned pages cannot be moved because something needs to
    directly access physical memory. They may not be on any
    LRU list.

    I recently saw an mlockalled process where mm->locked_vm became
    bigger than the virtual size of the process (!) because some
    memory was accounted for twice:

    Once when the page was mlocked and once when the Infiniband
    layer increased the refcount because it needt to pin the RDMA
    memory.

    This patch introduces a separate counter for pinned pages and
    accounts them seperately.

    Signed-off-by: Christoph Lameter
    Cc: Mike Marciniszyn
    Cc: Roland Dreier
    Cc: Sean Hefty
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • These files were getting the moduleparam infrastructure from the
    implicit presence of module.h being everywhere, but that is going
    away soon.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • These were getting it implicitly via device.h --> module.h but
    we are going to stop that when we clean up the headers.

    Fix these in advance so the tree remains biscect-clean.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • They had been getting it implicitly via device.h but we can't
    rely on that for the future, due to a pending cleanup so fix
    it now.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • They get it via module.h (via device.h) but we want to clean that up.
    When we do, we'll get things like:

    CC [M] drivers/infiniband/core/sysfs.o
    sysfs.c:361: error: 'S_IRUGO' undeclared here (not in a function)
    sysfs.c:654: error: 'S_IWUSR' undeclared here (not in a function)

    so add in the stat header it is using explicitly in advance.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • There's no need to set the vlan-related fields in an IBoE send WQE
    control segment:

    - the vlan to be used by a UD QP is set in the datagram segment.
    - for GSI (CM) QP, all the headers down to 8021q and MAC are built by
    the software anyway.

    Signed-off-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Or Gerlitz
     
  • The IBoE port MTU is derived from the corresponding Ethernet netdevice
    MTU, which can support jumbo frames of 9K, and hence surely supports
    the max IB mtu of 4K.

    Signed-off-by: Or Gerlitz
    Signed-off-by: Roland Dreier

    Or Gerlitz
     
  • QPs need to be moved to error before telling the firwmare to shutdown
    the queue. Otherwise, the application can submit WRs that will never
    get fetched by the hardware and never flushed by the driver.

    Signed-off-by: Kumar Sanghvi
    Acked-by: Steve Wise
    Signed-off-by: Roland Dreier

    Tom Tucker
     
  • Commit 01e7da6ba53c ("RDMA/cxgb4: Make sure flush CQ entries are
    collected on connection close") introduced a potential problem where a
    CQ's comp_handler can get called simultaneously from different places
    in the iw_cxgb4 driver. This does not comply with
    Documentation/infiniband/core_locking.txt, which states that at a
    given point of time, there should be only one callback per CQ should
    be active.

    This problem was reported by Parav Pandit .
    Based on discussion between Parav Pandit and Steve Wise, this patch
    fixes the above problem by serializing the calls to a CQ's
    comp_handler using a spin_lock.

    Reported-by: Parav Pandit
    Signed-off-by: Kumar Sanghvi
    Acked-by: Steve Wise
    Signed-off-by: Roland Dreier

    Kumar Sanghvi
     
  • iw_cxgb3 has a potential problem where a CQ's comp_handler can get
    called simultaneously from different places in iw_cxgb3 driver. This
    does not comply with Documentation/infiniband/core_locking.txt, which
    states that at a given point of time, there should be only one
    callback per CQ should be active.

    Such problem was reported by Parav Pandit
    for iw_cxgb4 driver. Based on discussion between Parav Pandit and
    Steve Wise, this patch fixes the above problem by serializing the
    calls to a CQ's comp_handler using a spin_lock.

    Signed-off-by: Kumar Sanghvi
    Acked-by: Steve Wise
    Signed-off-by: Roland Dreier

    Kumar Sanghvi
     
  • Fix an issue where the link would come up after replugging a cable
    even if it has been DISABLED manually.

    Signed-off-by: Mitko Haralanov
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mitko Haralanov
     

29 Oct, 2011

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (204 commits)
    [SCSI] qla4xxx: export address/port of connection (fix udev disk names)
    [SCSI] ipr: Fix BUG on adapter dump timeout
    [SCSI] megaraid_sas: Fix instance access in megasas_reset_timer
    [SCSI] hpsa: change confusing message to be more clear
    [SCSI] iscsi class: fix vlan configuration
    [SCSI] qla4xxx: fix data alignment and use nl helpers
    [SCSI] iscsi class: fix link local mispelling
    [SCSI] iscsi class: Replace iscsi_get_next_target_id with IDA
    [SCSI] aacraid: use lower snprintf() limit
    [SCSI] lpfc 8.3.27: Change driver version to 8.3.27
    [SCSI] lpfc 8.3.27: T10 additions for SLI4
    [SCSI] lpfc 8.3.27: Fix queue allocation failure recovery
    [SCSI] lpfc 8.3.27: Change algorithm for getting physical port name
    [SCSI] lpfc 8.3.27: Changed worst case mailbox timeout
    [SCSI] lpfc 8.3.27: Miscellanous logic and interface fixes
    [SCSI] megaraid_sas: Changelog and version update
    [SCSI] megaraid_sas: Add driver workaround for PERC5/1068 kdump kernel panic
    [SCSI] megaraid_sas: Add multiple MSI-X vector/multiple reply queue support
    [SCSI] megaraid_sas: Add support for MegaRAID 9360/9380 12GB/s controllers
    [SCSI] megaraid_sas: Clear FUSION_IN_RESET before enabling interrupts
    ...

    Linus Torvalds
     
  • Set the extended active speeds based on the hardware configuration.

    Signed-off-by: Marcel Apfelbaum
    Reviewed-by: Hal Rosenstock

    [ Move FDR-10 handling into ib_link_query_port(). - Roland ]

    Signed-off-by: Roland Dreier

    Marcel Apfelbaum
     

22 Oct, 2011

8 commits

  • Hold the link state machine until the tuning data is read from the
    QSFP EEPROM so correct tuning settings are applied before the state
    machine attempts to bring the link up. Link is also held on cable
    unplug in case a different cable is used.

    Signed-off-by: Mitko Haralanov
    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mitko Haralanov
     
  • This was probably present from initial submission.

    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mike Marciniszyn
     
  • Review of qib_ruc_check_hdr() shows that the s_lock is not required in
    the normal case. The r_lock is held in all cases, and protects the qp
    fields that are read.

    The s_lock will be needed to around the call to qib_migrate_qp() to
    insure that the send engine sees a consistent set of fields.

    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mike Marciniszyn
     
  • A new field is added to qib_qp called timeout_jiffies. It is
    initialized upon create and modify.

    The field is now used instead of a computation based on qp->timeout.

    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mike Marciniszyn
     
  • The heavy weight spinlock in qib_lookup_qpn() is replaced with RCU.
    The hash list itself is now accessed via jhash functions instead of mod.

    The changes should benefit multiple receive contexts in different
    processors by not contending for the lock just to read the hash
    structures.

    The patch also adds a lookaside_qp (pointer) and a lookaside_qpn in
    the context. The interrupt handler will test the current packet's qpn
    against lookaside_qpn if the lookaside_qp pointer is non-NULL. The
    pointer is NULL'ed when the interrupt handler exits.

    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mike Marciniszyn
     
  • The context init now saves a shift from rcvegrbufs_perchunk
    rcvegrbufs_perchunk_shift using ilog2. A BUG_ON() protects the
    power of 2 assumption.

    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mike Marciniszyn
     
  • Store both the encoded and decoded MTU in the QP structure as a minor
    optimization for UC/RC receive routines.

    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mike Marciniszyn
     
  • The memset for zeroing work completions had been unconditional.

    This patch removes the memset and moves the zeroing into the work
    completion with a more explicit field by field set. With this patch,
    non-ONLY/non-LAST packets will avoid the overhead since they will not
    generate a completion.

    Signed-off-by: Mike Marciniszyn
    Signed-off-by: Roland Dreier

    Mike Marciniszyn