18 Dec, 2019

1 commit

  • commit 324e1c402069e8d277d2a2b18ce40bde1265b96a upstream.

    In cases where I/O may be aborted, such as driver unload or link bounces,
    the system will crash based on a bad ndlp pointer.

    Example:
    RIP: 0010:lpfc_sli4_abts_err_handler+0x15/0x140 [lpfc]
    ...
    lpfc_sli4_io_xri_aborted+0x20d/0x270 [lpfc]
    lpfc_sli4_sp_handle_abort_xri_wcqe.isra.54+0x84/0x170 [lpfc]
    lpfc_sli4_fp_handle_cqe+0xc2/0x480 [lpfc]
    __lpfc_sli4_process_cq+0xc6/0x230 [lpfc]
    __lpfc_sli4_hba_process_cq+0x29/0xc0 [lpfc]
    process_one_work+0x14c/0x390

    Crash was caused by a bad ndlp address passed to I/O indicated by the XRI
    aborted CQE. The address was not NULL so the routine deferenced the ndlp
    ptr. The bad ndlp also caused the lpfc_sli4_io_xri_aborted to call an
    erroneous io handler. Root cause for the bad ndlp was an lpfc_ncmd that
    was aborted, put on the abort_io list, completed, taken off the abort_io
    list, sent to lpfc_release_nvme_buf where it was put back on the abort_io
    list because the lpfc_ncmd->flags setting LPFC_SBUF_XBUSY was not cleared
    on the final completion.

    Rework the exchange busy handling to ensure the flags are properly set for
    both scsi and nvme.

    Fixes: c490850a0947 ("scsi: lpfc: Adapt partitioned XRI lists to efficient sharing")
    Cc: # v5.1+
    Link: https://lore.kernel.org/r/20191018211832.7917-6-jsmart2021@gmail.com
    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Signed-off-by: Martin K. Petersen
    Signed-off-by: Greg Kroah-Hartman

    James Smart
     

20 Aug, 2019

1 commit

  • Typical SLI-4 hardware supports up to 2 4KB pages to be registered per XRI
    to contain the exchanges Scatter/Gather List. This caps the number of SGL
    elements that can be in the SGL. There are not extensions to extend the
    list out of the 2 pages.

    The G7 hardware adds a SGE type that allows the SGL to be vectored to a
    different scatter/gather list segment. And that segment can contain a SGE
    to go to another segment and so on. The initial segment must still be
    pre-registered for the XRI, but it can be a much smaller amount (256Bytes)
    as it can now be dynamically grown. This much smaller allocation can
    handle the SG list for most normal I/O, and the dynamic aspect allows it to
    support many MB's if needed.

    The implementation creates a pool which contains "segments" and which is
    initially sized to hold the initial small segment per xri. If an I/O
    requires additional segments, they are allocated from the pool. If the
    pool has no more segments, the pool is grown based on what is now
    needed. After the I/O completes, the additional segments are returned to
    the pool for use by other I/Os. Once allocated, the additional segments are
    not released under the assumption of "if needed once, it will be needed
    again". Pools are kept on a per-hardware queue basis, which is typically
    1:1 per cpu, but may be shared by multiple cpus.

    The switch to the smaller initial allocation significantly reduces the
    memory footprint of the driver (which only grows if large ios are
    issued). Based on the several K of XRIs for the adapter, the 8KB->256B
    reduction can conserve 32MBs or more.

    It has been observed with per-cpu resource pools that allocating a resource
    on CPU A, may be put back on CPU B. While the get routines are distributed
    evenly, only a limited subset of CPUs may be handling the put routines.
    This can put a strain on the lpfc_put_cmd_rsp_buf_per_cpu routine because
    all the resources are being put on a limited subset of CPUs.

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Signed-off-by: Martin K. Petersen

    James Smart
     

20 Mar, 2019

2 commits

  • The driver periodically checks for adapter error in a background thread. If
    the thread detects an error, the adapter will be reset including the
    deletion and reallocation of workqueues on the adapter. Simultaneously,
    there may be a user-space request to offline the adapter which may try to
    do many of the same steps, in parallel, on a different thread. As memory
    was deallocated while unexpected, the parallel offline request hit a bad
    pointer.

    Add coordination between the two threads. The error recovery thread has
    precedence. So, when an error is detected, a flag is set on the adapter to
    indicate the error thread is terminating the adapter. But, before doing
    that work, it will look for a flag that is set by the offline flow, and if
    set, will wait for it to complete before then processing the error handling
    path. Similarly, in the offline thread, it first checks for whether the
    error thread is resetting the adapter, and if so, will then wait for the
    error thread to finish. Only after it has finished, will it set its flag
    and offline the adapter.

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Signed-off-by: Martin K. Petersen

    James Smart
     
  • The debug ktime counters that trace an io were inadvertently not placed in
    the common section of an io buffer. Thus, they generate an invalid opcode
    error when accessed.

    Move the ktime counters into the common area.

    Fixes: 0794d601d174 ("scsi: lpfc: Implement common IO buffers between NVME and SCSI")
    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Signed-off-by: Martin K. Petersen

    James Smart
     

06 Feb, 2019

4 commits

  • For files modified as part of 12.2.0.0 patches, update copyright to 2019

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Martin K. Petersen

    James Smart
     
  • A scsi host lock is taken on every io completion to check whether the abort
    handler is waiting on the io completion. This is an expensive lock to take
    on all completion when rarely in an abort condition.

    Replace scsi host lock with command-specific lock. Synchronize completion
    and abort paths by new cmd lock. Ensure all flag changing and nulling of
    context pointers taken under lock. When adding lock to task management
    abort, realized it was missing other synchronization locks. Added that
    synchronization to match normal paths.

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Martin K. Petersen

    James Smart
     
  • The XRI get/put lists were partitioned per hardware queue. However, the
    adapter rarely had sufficient resources to give a large number of resources
    per queue. As such, it became common for a cpu to encounter a lack of XRI
    resource and request the upper io stack to retry after returning a BUSY
    condition. This occurred even though other cpus were idle and not using
    their resources.

    Create as efficient a scheme as possible to move resources to the cpus that
    need them. Each cpu maintains a small private pool which it allocates from
    for io. There is a watermark that the cpu attempts to keep in the private
    pool. The private pool, when empty, pulls from a global pool from the
    cpu. When the cpu's global pool is empty it will pull from other cpu's
    global pool. As there many cpu global pools (1 per cpu or hardware queue
    count) and as each cpu selects what cpu to pull from at different rates and
    at different times, it creates a radomizing effect that minimizes the
    number of cpu's that will contend with each other when the steal XRI's from
    another cpu's global pool.

    On io completion, a cpu will push the XRI back on to its private pool. A
    watermark level is maintained for the private pool such that when it is
    exceeded it will move XRI's to the CPU global pool so that other cpu's may
    allocate them.

    On NVME, as heartbeat commands are critical to get placed on the wire, a
    single expedite pool is maintained. When a heartbeat is to be sent, it will
    allocate an XRI from the expedite pool rather than the normal cpu
    private/global pools. On any io completion, if a reduction in the expedite
    pools is seen, it will be replenished before the XRI is placed on the cpu
    private pool.

    Statistics are added to aid understanding the XRI levels on each cpu and
    their behaviors.

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Martin K. Petersen

    James Smart
     
  • Once the IO buff allocations were made shared, there was a single XRI
    buffer list shared by all hardware queues. A single list isn't great for
    performance when shared across the per-cpu hardware queues.

    Create a separate XRI IO buffer get/put list for each Hardware Queue. As
    SGLs and associated IO buffers get allocated/posted to the firmware; round
    robin their assignment across all available hardware Queues so that there
    is an equitable assignment.

    Modify SCSI and NVME IO submit code paths to use the Hardware Queue logic
    for XRI allocation.

    Add a debugfs interface to display hardware queue statistics

    Added new empty_io_bufs counter to track if a cpu runs out of XRIs.

    Replace common_ variables/names with io_ to make meanings clearer.

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Martin K. Petersen

    James Smart
     

08 Dec, 2018

1 commit

  • The driver data structure for managing a mailbox command contained two
    context fields. Unfortunately, the context were considered "generic" to be
    used at the whim of the command code. Of course, one section of code used
    fields this way, while another did it that way, and eventually there were
    mixups.

    Refactored the structure so that the generic contexts become a node context
    and a buffer context and all code standardizes on their use.

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Signed-off-by: Martin K. Petersen

    James Smart
     

11 Jul, 2018

1 commit


27 Jun, 2018

1 commit

  • The get_seconds() function suffers from a possible overflow in 2038 or
    2106, as well as jitter due to settimeofday or leap second updates, and is
    deprecated.

    As we are interested in elapsed time only, using ktime_get_seconds() to
    read the CLOCK_MONOTONIC timebase is ideal here. This also lets us remove
    the hack that tries to deal with get_seconds() going slightly backwards,
    which cannot happen with montonic timestamps.

    Signed-off-by: Arnd Bergmann
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Martin K. Petersen

    Arnd Bergmann
     

13 Mar, 2018

2 commits

  • POST_SGL_PAGES mailbox command failed with status (timeout).

    wait_event_interruptible_timeout when called from mailbox wait interface,
    gets interrupted, and will randomly fail. Behavior seems very specific to 1
    particular server type.

    Fix by changing from wait_event_interruptible_timeout to
    wait_for_completion_timeout.

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Signed-off-by: Martin K. Petersen

    James Smart
     
  • The driver is very sloppy about the WQE structure passed between routines.
    The base struct type is a 64byte wqe. But in many routines they typecast and
    access 128byte wqes. There were a couple of cases in the past (corrected
    already) where the typecasts were incorrectly done and the 64byte buffer was
    accessed as a 128 byte buffer.

    Clean this up by properly declaring wqe's as 128byte wqe's and removing the
    typecasts. 64byte wqes are considered a subset of the 128byte wqes.

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Signed-off-by: Martin K. Petersen

    James Smart
     

13 Jun, 2017

1 commit

  • Administrator intervention is currently required to get good numbers
    when switching from running latency tests to IOPS tests.

    The configured interrupt coalescing values will greatly effect the
    results of these tests. Currently, the driver has a single coalescing
    value set by values of the module attribute. This patch changes the
    driver to support auto-configuration of the coalescing value based on
    the total number of outstanding IOs and average number of CQEs processed
    per interrupt for an EQ. Values are checked every 5 seconds.

    The driver defaults to the automatic selection. Automatic selection can
    be disabled by the new lpfc_auto_imax module_parameter.

    Older hardware can only change interrupt coalescing by mailbox
    command. Newer hardware supports change via a register. The patch
    support both.

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Signed-off-by: Martin K. Petersen

    James Smart
     

23 Feb, 2017

4 commits

  • Update copyrights to 2017 for all files touched in this patch set

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Martin K. Petersen

    James Smart
     
  • NVME Target: Base modifications

    This set of patches adds the base modifications for NVME target support

    The base modifications consist of:
    - Additional module parameters or configuration tuning
    - Enablement of configuration mode for NVME target. Ties into the
    queueing model put into place by the initiator basemods patches.
    - Target-specific buffer pools, dma pools, sgl pools

    [mkp: fixed space at end of file]

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Martin K. Petersen

    James Smart
     
  • NVME Initiator: Base modifications

    This patch adds base modifications for NVME initiator support.

    The base modifications consist of:
    - Formal split of SLI3 rings from SLI-4 WQs (sometimes referred to as
    rings as well) as implementation now widely varies between the two.
    - Addition of configuration modes:
    SCSI initiator only; NVME initiator only; NVME target only; and
    SCSI and NVME initiator.
    The configuration mode drives overall adapter configuration,
    offloads enabled, and resource splits.
    NVME support is only available on SLI-4 devices and newer fw.
    - Implements the following based on configuration mode:
    - Exchange resources are split by protocol; Obviously, if only
    1 mode, then no split occurs. Default is 50/50. module attribute
    allows tuning.
    - Pools and config parameters are separated per-protocol
    - Each protocol has it's own set of queues, but share interrupt
    vectors.
    SCSI:
    SLI3 devices have few queues and the original style of queue
    allocation remains.
    SLI4 devices piggy back on an "io-channel" concept that
    eventually needs to merge with scsi-mq/blk-mq support (it is
    underway). For now, the paradigm continues as it existed
    prior. io channel allocates N msix and N WQs (N=4 default)
    and either round robins or uses cpu # modulo N for scheduling.
    A bunch of module parameters allow the configuration to be
    tuned.
    NVME (initiator):
    Allocates an msix per cpu (or whatever pci_alloc_irq_vectors
    gets)
    Allocates a WQ per cpu, and maps the WQs to msix on a WQ #
    modulo msix vector count basis.
    Module parameters exist to cap/control the config if desired.
    - Each protocol has its own buffer and dma pools.

    I apologize for the size of the patch.

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart

    ----
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Martin K. Petersen

    James Smart
     
  • This contains code cleanups that were in the prior patch set.
    This allows better review of real changes later.

    minor code cleanups:
    fix indentation, punctuation, line length
    addition/reduction of whitespace
    remove unneeded parens, braces
    lpfc_debugfs_nodelist_data: print as u64 rather than byte by byte
    covert printk(KERN_ERR to pr_err
    small print string deltas
    use num_present_cpus() rather than count them
    comment updates
    rctl/type names moved to module variable, not on stack

    Signed-off-by: Dick Kennedy
    Signed-off-by: James Smart
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Martin K. Petersen

    James Smart
     

16 Jul, 2016

2 commits


10 Apr, 2015

2 commits


17 Sep, 2014

1 commit


03 Jun, 2014

1 commit


16 Mar, 2014

1 commit


11 Sep, 2013

2 commits


24 Aug, 2013

2 commits


27 Nov, 2012

1 commit


14 Sep, 2012

3 commits


17 May, 2012

1 commit


19 Feb, 2012

1 commit

  • T10 Diff fixes and enhancements:

    - Add SLI4 Lancer support for T10 DIF / BlockGuard (121980)
    - Fix SLI4 BlockGuard behavior when protection data is generated by HBA (121980)
    - Enhance debugfs for injecting T10 DIF errors (123966, 132966)
    - Fix Incorrect usage of bghm for BlockGuard errors (127022)

    Signed-off-by: Alex Iannicelli
    Signed-off-by: James Smart
    Signed-off-by: James Bottomley

    James Smart
     

17 Oct, 2011

1 commit


27 May, 2011

1 commit


22 Dec, 2010

2 commits

  • Implement the FC and SLI async event handlers:

    - Updated MQ_CREATE_EXT mailbox structure to include fc and SLI async events.
    - Added the SLI trailer code.
    - Split physical field into type and number to reflect latest SLI spec.
    - Changed lpfc_acqe_fcoe to lpfc_acqe_fip to reflect latest Spec changes.
    - Added lpfc_acqe_fc_la structure for FC link attention async events.
    - Added lpfc_acqe_sli structure for sli async events.
    - Added lpfc_sli4_async_fc_evt routine to handle fc la async events.
    - Added lpfc_sli4_async_sli routine to handle sli async events.
    - Moved LPFC_TRAILER_CODE_FC to be handled by its own handler function.

    Signed-off-by: Alex Iannicelli
    Signed-off-by: James Smart
    Signed-off-by: James Bottomley

    James Smart
     
  • Added support for ELS RRQ command

    - Add new routine lpfc_set_rrq_active() to track XRI qualifier state.
    - Add new module parameter lpfc_enable_rrq to control RRQ operation.
    - Add logic to ELS RRQ completion handler and xri qualifier timeout
    to clear XRI qualifier state.
    - Use OX_ID from XRI_ABORTED_CQE for RRQ payload.
    - Tie abort and XRI_ABORTED_CQE andler to RRQ generation.

    Signed-off-by: Alex Iannicelli
    Signed-off-by: James Smart
    Signed-off-by: James Bottomley

    James Smart
     

28 Jul, 2010

1 commit