21 Sep, 2021

3 commits

  • Remove the freeze/unfreeze around changes to the number of hardware
    queues. Study and retest has indicated there are no ios that can be
    active at this point so there is nothing to freeze.

    nvme-fc is draining the queues in the shutdown and error recovery path
    in __nvme_fc_abort_outstanding_ios.

    This patch primarily reverts 88e837ed0f1f "nvme-fc: wait for queues to
    freeze before calling update_hr_hw_queues". It's not an exact revert as
    it leaves the adjusting of hw queues only if the count changes.

    Signed-off-by: James Smart
    [dwagner: added explanation why no IO is pending]
    Signed-off-by: Daniel Wagner
    Reviewed-by: Ming Lei
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • To avoid race between time out and tear down, in tear down process,
    first we quiesce the queue, and then delete the timer and cancel
    the time out work for the queue.

    This patch merges the admin and io sync ops into the queue teardown logic
    as shown in the RDMA patch 3017013dcc "nvme-rdma: avoid race between time
    out and tear down". There is no teardown_lock in nvme-fc.

    Signed-off-by: James Smart
    Tested-by: Daniel Wagner
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Daniel Wagner
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • In case the number of hardware queues changes, we need to update the
    tagset and the mapping of ctx to hctx first.

    If we try to create and connect the I/O queues first, this operation
    will fail (target will reject the connect call due to the wrong number
    of queues) and hence we bail out of the recreate function. Then we
    will to try the very same operation again, thus we don't make any
    progress.

    Signed-off-by: Daniel Wagner
    Reviewed-by: Ming Lei
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Hannes Reinecke
    Reviewed-by: James Smart
    Signed-off-by: Christoph Hellwig

    Daniel Wagner
     

10 Jul, 2021

1 commit

  • Pull more block updates from Jens Axboe:
    "A combination of changes that ended up depending on both the driver
    and core branch (and/or the IDE removal), and a few late arriving
    fixes. In detail:

    - Fix io ticks wrap-around issue (Chunguang)

    - nvme-tcp sock locking fix (Maurizio)

    - s390-dasd fixes (Kees, Christoph)

    - blk_execute_rq polling support (Keith)

    - blk-cgroup RCU iteration fix (Yu)

    - nbd backend ID addition (Prasanna)

    - Partition deletion fix (Yufen)

    - Use blk_mq_alloc_disk for mmc, mtip32xx, ubd (Christoph)

    - Removal of now dead block request types due to IDE removal
    (Christoph)

    - Loop probing and control device cleanups (Christoph)

    - Device uevent fix (Christoph)

    - Misc cleanups/fixes (Tetsuo, Christoph)"

    * tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block: (34 commits)
    blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs
    block: fix the problem of io_ticks becoming smaller
    nvme-tcp: can't set sk_user_data without write_lock
    loop: remove unused variable in loop_set_status()
    block: remove the bdgrab in blk_drop_partitions
    block: grab a device refcount in disk_uevent
    s390/dasd: Avoid field over-reading memcpy()
    dasd: unexport dasd_set_target_state
    block: check disk exist before trying to add partition
    ubd: remove dead code in ubd_setup_common
    nvme: use return value from blk_execute_rq()
    block: return errors from blk_execute_rq()
    nvme: use blk_execute_rq() for passthrough commands
    block: support polling through blk_execute_rq
    block: remove REQ_OP_SCSI_{IN,OUT}
    block: mark blk_mq_init_queue_data static
    loop: rewrite loop_exit using idr_for_each_entry
    loop: split loop_lookup
    loop: don't allow deleting an unspecified loop device
    loop: move loop_ctl_mutex locking into loop_add
    ...

    Linus Torvalds
     

03 Jul, 2021

1 commit

  • Pull SCSI updates from James Bottomley:
    "This series consists of the usual driver updates (ufs, ibmvfc,
    megaraid_sas, lpfc, elx, mpi3mr, qedi, iscsi, storvsc, mpt3sas) with
    elx and mpi3mr being new drivers.

    The major core change is a rework to drop the status byte handling
    macros and the old bit shifted definitions and the rest of the updates
    are minor fixes"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (287 commits)
    scsi: aha1740: Avoid over-read of sense buffer
    scsi: arcmsr: Avoid over-read of sense buffer
    scsi: ips: Avoid over-read of sense buffer
    scsi: ufs: ufs-mediatek: Add missing of_node_put() in ufs_mtk_probe()
    scsi: elx: libefc: Fix IRQ restore in efc_domain_dispatch_frame()
    scsi: elx: libefc: Fix less than zero comparison of a unsigned int
    scsi: elx: efct: Fix pointer error checking in debugfs init
    scsi: elx: efct: Fix is_originator return code type
    scsi: elx: efct: Fix link error for _bad_cmpxchg
    scsi: elx: efct: Eliminate unnecessary boolean check in efct_hw_command_cancel()
    scsi: elx: efct: Do not use id uninitialized in efct_lio_setup_session()
    scsi: elx: efct: Fix error handling in efct_hw_init()
    scsi: elx: efct: Remove redundant initialization of variable lun
    scsi: elx: efct: Fix spelling mistake "Unexected" -> "Unexpected"
    scsi: lpfc: Fix build error in lpfc_scsi.c
    scsi: target: iscsi: Remove redundant continue statement
    scsi: qla4xxx: Remove redundant continue statement
    scsi: ppa: Switch to use module_parport_driver()
    scsi: imm: Switch to use module_parport_driver()
    scsi: mpt3sas: Fix error return value in _scsih_expander_add()
    ...

    Linus Torvalds
     

01 Jul, 2021

2 commits

  • The generic blk_execute_rq() knows how to handle polled completions. Use
    that instead of implementing an nvme specific handler.

    Signed-off-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Chaitanya Kulkarni
    Link: https://lore.kernel.org/r/20210610214437.641245-3-kbusch@kernel.org
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Pull block driver updates from Jens Axboe:
    "Pretty calm round, mostly just NVMe and a bit of MD:

    - NVMe updates (via Christoph)
    - improve the APST configuration algorithm (Alexey Bogoslavsky)
    - look for StorageD3Enable on companion ACPI device
    (Mario Limonciello)
    - allow selecting the network interface for TCP connections
    (Martin Belanger)
    - misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King,
    Christoph)
    - move the ACPI StorageD3 code to drivers/acpi/ and add quirks
    for certain AMD CPUs (Mario Limonciello)
    - zoned device support for nvmet (Chaitanya Kulkarni)
    - fix the rules for changing the serial number in nvmet
    (Noam Gottlieb)
    - various small fixes and cleanups (Dan Carpenter, JK Kim,
    Chaitanya Kulkarni, Hannes Reinecke, Wesley Sheng, Geert
    Uytterhoeven, Daniel Wagner)

    - MD updates (Via Song)
    - iostats rewrite (Guoqing Jiang)
    - raid5 lock contention optimization (Gal Ofri)

    - Fall through warning fix (Gustavo)

    - Misc fixes (Gustavo, Jiapeng)"

    * tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block: (78 commits)
    nvmet: use NVMET_MAX_NAMESPACES to set nn value
    loop: Fix missing discard support when using LOOP_CONFIGURE
    nvme.h: add missing nvme_lba_range_type endianness annotations
    nvme: remove zeroout memset call for struct
    nvme-pci: remove zeroout memset call for struct
    nvmet: remove zeroout memset call for struct
    nvmet: add ZBD over ZNS backend support
    nvmet: add Command Set Identifier support
    nvmet: add nvmet_req_bio put helper for backends
    nvmet: add req cns error complete helper
    block: export blk_next_bio()
    nvmet: remove local variable
    nvmet: use nvme status value directly
    nvmet: use u32 type for the local variable nsid
    nvmet: use u32 for nvmet_subsys max_nsid
    nvmet: use req->cmd directly in file-ns fast path
    nvmet: use req->cmd directly in bdev-ns fast path
    nvmet: make ver stable once connection established
    nvmet: allow mn change if subsys not discovered
    nvmet: make sn stable once connection was established
    ...

    Linus Torvalds
     

17 Jun, 2021

1 commit


10 Jun, 2021

1 commit

  • Add a new sysfs attribute, appid_store, which can be used to set the
    application identifier in the blkcg associated with a cgroup id.

    Below is the interface provided to set the app_id:

    echo ":" >> /sys/class/fc/fc_udev_device/appid_store

    echo "457E:100000109b521d27" >> /sys/class/fc/fc_udev_device/appid_store

    Link: https://lore.kernel.org/r/20210608043556.274139-4-muneendra.kumar@broadcom.com
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Muneendra Kumar
    Signed-off-by: Martin K. Petersen

    Muneendra Kumar
     

25 May, 2021

1 commit

  • Returning an nvme status from nvme_fc_create_association() indicates
    that the association is established, and we should honour the DNR bit.
    If it's set a reconnect attempt will just return the same error, so
    we can short-circuit the reconnect attempts and fail the connection
    directly.

    Signed-off-by: Hannes Reinecke
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Himanshu Madhani
    Reviewed-by: James Smart
    Signed-off-by: Christoph Hellwig

    Hannes Reinecke
     

19 May, 2021

1 commit

  • The __nvmf_check_ready() routine used to bounce all filesystem io if the
    controller state isn't LIVE. However, a later patch changed the logic so
    that it rejection ends up being based on the Q live check. The FC
    transport has a slightly different sequence from rdma and tcp for
    shutting down queues/marking them non-live. FC marks its queue non-live
    after aborting all ios and waiting for their termination, leaving a
    rather large window for filesystem io to continue to hit the transport.
    Unfortunately this resulted in filesystem I/O or applications seeing I/O
    errors.

    Change the FC transport to mark the queues non-live at the first sign of
    teardown for the association (when I/O is initially terminated).

    Fixes: 73a5379937ec ("nvme-fabrics: allow to queue requests for live queues")
    Signed-off-by: James Smart
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    James Smart
     

04 May, 2021

1 commit

  • queue_rq() in pci only checks if the dispatched queue (nvmeq) is ready,
    e.g. not being suspended. Since nvme_alloc_admin_tags() in reset flow
    restarts the admin queue, users are able to submit admin commands to a
    controller before reset_work() completes. Commands submitted under this
    condition may interfere with commands that performs identify, IO queue
    setup in reset_work(), and may result in a hang described in the
    following patch.

    As seen in the fabrics, user commands are prevented from being executed
    under inproper controller states. We may reuse this logic to maintain a
    clear admin queue during reset_work().

    Signed-off-by: Tao Chiu
    Signed-off-by: Cody Wong
    Reviewed-by: Leon Chien
    Reviewed-by: Keith Busch
    Signed-off-by: Christoph Hellwig

    Tao Chiu
     

29 Apr, 2021

1 commit

  • Pull block driver updates from Jens Axboe:

    - MD changes via Song:
    - raid5 POWER fix
    - raid1 failure fix
    - UAF fix for md cluster
    - mddev_find_or_alloc() clean up
    - Fix NULL pointer deref with external bitmap
    - Performance improvement for raid10 discard requests
    - Fix missing information of /proc/mdstat

    - rsxx const qualifier removal (Arnd)

    - Expose allocated brd pages (Calvin)

    - rnbd via Gioh Kim:
    - Change maintainer
    - Change domain address of maintainers' email
    - Add polling IO mode and document update
    - Fix memory leak and some bug detected by static code analysis
    tools
    - Code refactoring

    - Series of floppy cleanups/fixes (Denis)

    - s390 dasd fixes (Julian)

    - kerneldoc fixes (Lee)

    - null_blk double free (Lv)

    - null_blk virtual boundary addition (Max)

    - Remove xsysace driver (Michal)

    - umem driver removal (Davidlohr)

    - ataflop fixes (Dan)

    - Revalidate disk removal (Christoph)

    - Bounce buffer cleanups (Christoph)

    - Mark lightnvm as deprecated (Christoph)

    - mtip32xx init cleanups (Shixin)

    - Various fixes (Tian, Gustavo, Coly, Yang, Zhang, Zhiqiang)

    * tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block: (143 commits)
    async_xor: increase src_offs when dropping destination page
    drivers/block/null_blk/main: Fix a double free in null_init.
    md/raid1: properly indicate failure when ending a failed write request
    md-cluster: fix use-after-free issue when removing rdev
    nvme: introduce generic per-namespace chardev
    nvme: cleanup nvme_configure_apst
    nvme: do not try to reconfigure APST when the controller is not live
    nvme: add 'kato' sysfs attribute
    nvme: sanitize KATO setting
    nvmet: avoid queuing keep-alive timer if it is disabled
    brd: expose number of allocated pages in debugfs
    ataflop: fix off by one in ataflop_probe()
    ataflop: potential out of bounds in do_format()
    drbd: Fix fall-through warnings for Clang
    block/rnbd: Use strscpy instead of strlcpy
    block/rnbd-clt-sysfs: Remove copy buffer overlap in rnbd_clt_get_path_name
    block/rnbd-clt: Remove max_segment_size
    block/rnbd-clt: Generate kobject_uevent when the rnbd device state changes
    block/rnbd-srv: Remove unused arguments of rnbd_srv_rdma_ev
    Documentation/ABI/rnbd-clt: Add description for nr_poll_queues
    ...

    Linus Torvalds
     

03 Apr, 2021

4 commits

  • SGLs support is mandatory for NVMe/FC, make sure that the target is
    aligned to the specification.

    Signed-off-by: Max Gurtovoy
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig

    Max Gurtovoy
     
  • All nvme transport drivers preallocate an nvme command for each request.
    Assume to use that command for nvme_setup_cmd() instead of requiring
    drivers pass a pointer to it. All nvme drivers must initialize the
    generic nvme_request 'cmd' to point to the transport's preallocated
    nvme_command.

    The generic nvme_request cmd pointer had previously been used only as a
    temporary copy for passthrough commands. Since it now points to the
    command that gets dispatched, passthrough commands must directly set it
    up prior to executing the request.

    Signed-off-by: Keith Busch
    Reviewed-by: Jens Axboe
    Reviewed-by: Himanshu Madhani
    Signed-off-by: Christoph Hellwig

    Keith Busch
     
  • The nvme_fc_rcv_ls_req() function has first argument as pointer to
    remoteport named portprt, but in the documentation comment that is name
    is used as remoteport. Fix that to get rid if the compilation warning.

    drivers/nvme//host/fc.c:1724: warning: Function parameter or member 'portptr' not described in 'nvme_fc_rcv_ls_req'
    drivers/nvme//host/fc.c:1724: warning: Excess function parameter 'remoteport' description in 'nvme_fc_rcv_ls_req'

    Signed-off-by: Chaitanya Kulkarni
    Reviewed-by: James Smart
    Signed-off-by: Christoph Hellwig

    Chaitanya Kulkarni
     
  • This is a prep patch so that we can move the identify data structure
    related code initialization from nvme_init_identify() into a helper.

    Rename the function nvmet_init_identify() to nvmet_init_ctrl_finish().

    Next patch will move the nvme_id_ctrl related initialization from newly
    renamed function nvme_init_ctrl_finish() into the nvme_init_identify()
    helper.

    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig

    Chaitanya Kulkarni
     

18 Mar, 2021

1 commit

  • Fabrics drivers currently reserve two tags on the admin queue. But
    given that the connect command is only run on a freshly created queue
    or after all commands have been force aborted we only need to reserve
    a single tag.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Daniel Wagner

    Christoph Hellwig
     

11 Mar, 2021

3 commits

  • Recent patch to prevent calling __nvme_fc_abort_outstanding_ios in
    interrupt context results in a possible race condition. A controller
    reset results in errored io completions, which schedules error
    work. The change of error work to a work element allows it to fire
    after the ctrl state transition to NVME_CTRL_CONNECTING, causing
    any outstanding io (used to initialize the controller) to fail and
    cause problems for connect_work.

    Add a state check to only schedule error work if not in the RESETTING
    state.

    Fixes: 19fce0470f05 ("nvme-fc: avoid calling _nvme_fc_abort_outstanding_ios from interrupt context")
    Signed-off-by: Nigel Kirkland
    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • When a command has been aborted we should return NVME_SC_HOST_ABORTED_CMD
    to be consistent with the other transports.

    Signed-off-by: Hannes Reinecke
    Reviewed-by: Sagi Grimberg
    Reviewed-by: James Smart
    Reviewed-by: Daniel Wagner
    Signed-off-by: Christoph Hellwig

    Hannes Reinecke
     
  • nvme_fc_terminate_exchange() is being called when exchanges are
    being deleted, and as such we should be setting the NVME_REQ_CANCELLED
    flag to have identical behaviour on all transports.

    Signed-off-by: Hannes Reinecke
    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    Reviewed-by: James Smart
    Reviewed-by: Daniel Wagner
    Signed-off-by: Christoph Hellwig

    Hannes Reinecke
     

02 Feb, 2021

1 commit


06 Jan, 2021

1 commit

  • Recent patches changed calling sequences. nvme_fc_abort_outstanding_ios
    used to be called from a timeout or work context. Now it is being called
    in an io completion context, which can be an interrupt handler.
    Unfortunately, the abort outstanding ios routine attempts to stop nvme
    queues and nested routines that may try to sleep, which is in conflict
    with the interrupt handler.

    Correct replacing the direct call with a work element scheduling, and the
    abort outstanding ios routine will be called in the work element.

    Fixes: 95ced8a2c72d ("nvme-fc: eliminate terminate_io use by nvme_fc_error_recovery")
    Signed-off-by: James Smart
    Reported-by: Daniel Wagner
    Tested-by: Daniel Wagner
    Signed-off-by: Christoph Hellwig

    James Smart
     

02 Dec, 2020

1 commit


27 Oct, 2020

4 commits

  • __nvme_fc_terminate_io() is now called by only 1 place, in reset_work.
    Consoldate and move the functionality of terminate_io into reset_work.

    In reset_work, rather than calling the create_association directly,
    schedule the connect work element to do its thing. After scheduling,
    flush the connect work element to continue with semantic of not
    returning until connect has been attempted at least once.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • nvme_fc_error_recovery() special cases handling when in CONNECTING state
    and calls __nvme_fc_terminate_io(). __nvme_fc_terminate_io() itself
    special cases CONNECTING state and calls the routine to abort outstanding
    ios.

    Simplify the sequence by putting the call to abort outstanding I/Os
    directly in nvme_fc_error_recovery.

    Move the location of __nvme_fc_abort_outstanding_ios(), and
    nvme_fc_terminate_exchange() which is called by it, to avoid adding
    function prototypes for nvme_fc_error_recovery().

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • err_work was created to handle errors (mainly I/O timeouts) while in
    CONNECTING state. The flag for err_work_active is also unneeded.

    Remove err_work_active and err_work. The actions to abort I/Os are moved
    inline to nvme_error_recovery().

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • Whenever there are errors during CONNECTING, the driver recovers by
    aborting all outstanding ios and counts on the io completion to fail them
    and thus the connection/association they are on. However, the connection
    failure depends on a failure state from the core routines. Not all
    commands that are issued by the core routine are guaranteed to cause a
    failure of the core routine. They may be treated as a failure status and
    the status is then ignored.

    As such, whenever the transport enters error_recovery while CONNECTING,
    it will set a new flag indicating an association failed. The
    create_association routine which creates and initializes the controller,
    will monitor the state of the flag as well as the core routine error
    status and ensure the association fails if there was an error.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig

    James Smart
     

23 Oct, 2020

4 commits

  • We've had several complaints about a 10s reconnect delay (the default)
    when there was an error while there is connectivity to a subsystem.
    The max_reconnects and reconnect_delay are set in common code prior to
    calling the transport to create the controller.

    This change checks if the default reconnect delay is being used, and if
    so, it adjusts it to a shorter period (2s) for the nvme-fc transport.
    It does so by calculating the controller loss tmo window, changing the
    value of the reconnect delay, and then recalculating the maximum number
    of reconnect attempts allowed.

    Signed-off-by: James Smart
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • On reconnect, the code currently does not freeze the controller before
    possibly updating the number hw queues for the controller.

    Add the freeze before updating the number of hw queues. Note: the queues
    are already started and remain started through the reconnect.

    Signed-off-by: James Smart
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • The loop that backs out of hw io queue creation continues through index
    0, which corresponds to the admin queue as well.

    Fix the loop so it only proceeds through indexes 1..n which correspond to
    I/O queues.

    Signed-off-by: James Smart
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • Currently, an I/O timeout unconditionally invokes
    nvme_fc_error_recovery() which checks for LIVE or CONNECTING state. If
    live, the routine resets the controller which initiates a reconnect -
    which is valid. If CONNECTING, err_work is scheduled. Err_work then
    calls the terminate_io routine, which also checks for CONNECTING and
    noops any further action on outstanding I/O. The result is nothing
    happened to the timed out io. As such, if the command was dropped on
    the wire, it will never timeout / complete, and the connect process
    will hang.

    Change the behavior of the io timeout routine to unconditionally abort
    the I/O. I/O completion handling will note that an io failed due to an
    abort and will terminate the connection / association as needed. If the
    abort was unable to happen, continue with a call to
    nvme_fc_error_recovery(). To ensure something different happens in
    nvme_fc_error_recovery() rework it so at it will abort all I/Os on the
    association to force a failure.

    As I/O aborts now may occur outside of delete_association, counting for
    completion must be wary and only count those aborted during
    delete_association when TERMIO is set on the controller.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig

    James Smart
     

22 Sep, 2020

1 commit

  • The lldd may have made calls to delete a remote port or local port and
    the delete is in progress when the cli then attempts to create a new
    controller. Currently, this proceeds without error although it can't be
    very successful.

    Fix this by validating that both the host port and remote port are
    present when a new controller is to be created.

    Signed-off-by: James Smart
    Reviewed-by: Himanshu Madhani
    Signed-off-by: Christoph Hellwig

    James Smart
     

09 Sep, 2020

1 commit


22 Aug, 2020

2 commits

  • nvme_end_request is a bit misnamed, as it wraps around the
    blk_mq_complete_* API. It's semantics also are non-trivial, so give it
    a more descriptive name and add a comment explaining the semantics.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Mike Snitzer
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • On an error exit path, a negative error code should be returned
    instead of a positive return value.

    Fixes: e399441de9115 ("nvme-fabrics: Add host support for FC transport")
    Cc: James Smart
    Signed-off-by: Tianjia Zhang
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Tianjia Zhang
     

29 Jul, 2020

2 commits

  • Currently the FC transport is set max_hw_sectors based on the lldds
    max sgl segment count. However, the block queue max segments is
    set based on the controller's max_segments count, which the transport
    does not set. As such, the lldd is receiving sgl lists that are
    exceeding its max segment count.

    Set the controller max segment count and derive max_hw_sectors from
    the max segment count.

    Signed-off-by: James Smart
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Ewan D. Milne
    Signed-off-by: Christoph Hellwig

    James Smart
     
  • A deadlock happens in the following scenario with multipath:
    1) scan_work(nvme0) detects a new nsid while nvme0
    is an optimized path to it, path nvme1 happens to be
    inaccessible.

    2) Before scan_work is complete nvme0 disconnect is initiated
    nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING

    3) scan_work(1) attempts to submit IO,
    but nvme_path_is_optimized() observes nvme0 is not LIVE.
    Since nvme1 is a possible path IO is requeued and scan_work hangs.

    --
    Workqueue: nvme-wq nvme_scan_work [nvme_core]
    kernel: Call Trace:
    kernel: __schedule+0x2b9/0x6c0
    kernel: schedule+0x42/0xb0
    kernel: io_schedule+0x16/0x40
    kernel: do_read_cache_page+0x438/0x830
    kernel: read_cache_page+0x12/0x20
    kernel: read_dev_sector+0x27/0xc0
    kernel: read_lba+0xc1/0x220
    kernel: efi_partition+0x1e6/0x708
    kernel: check_partition+0x154/0x244
    kernel: rescan_partitions+0xae/0x280
    kernel: __blkdev_get+0x40f/0x560
    kernel: blkdev_get+0x3d/0x140
    kernel: __device_add_disk+0x388/0x480
    kernel: device_add_disk+0x13/0x20
    kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core]
    kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
    kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
    kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core]
    kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core]
    kernel: nvme_validate_ns+0x396/0x940 [nvme_core]
    kernel: nvme_scan_work+0x24f/0x380 [nvme_core]
    kernel: process_one_work+0x1db/0x380
    kernel: worker_thread+0x249/0x400
    kernel: kthread+0x104/0x140
    --

    4) Delete also hangs in flush_work(ctrl->scan_work)
    from nvme_remove_namespaces().

    Similiarly a deadlock with ana_work may happen: if ana_work has started
    and calls nvme_mpath_set_live and device_add_disk, it will
    trigger I/O. When we trigger disconnect I/O will block because
    our accessible (optimized) path is disconnecting, but the alternate
    path is inaccessible, so I/O blocks. Then disconnect tries to flush
    the ana_work and hangs.

    [ 605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
    [ 605.552087] Call Trace:
    [ 605.552683] __schedule+0x2b9/0x6c0
    [ 605.553507] schedule+0x42/0xb0
    [ 605.554201] io_schedule+0x16/0x40
    [ 605.555012] do_read_cache_page+0x438/0x830
    [ 605.556925] read_cache_page+0x12/0x20
    [ 605.557757] read_dev_sector+0x27/0xc0
    [ 605.558587] amiga_partition+0x4d/0x4c5
    [ 605.561278] check_partition+0x154/0x244
    [ 605.562138] rescan_partitions+0xae/0x280
    [ 605.563076] __blkdev_get+0x40f/0x560
    [ 605.563830] blkdev_get+0x3d/0x140
    [ 605.564500] __device_add_disk+0x388/0x480
    [ 605.565316] device_add_disk+0x13/0x20
    [ 605.566070] nvme_mpath_set_live+0x5e/0x130 [nvme_core]
    [ 605.567114] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
    [ 605.568197] nvme_update_ana_state+0xca/0xe0 [nvme_core]
    [ 605.569360] nvme_parse_ana_log+0xa1/0x180 [nvme_core]
    [ 605.571385] nvme_read_ana_log+0x76/0x100 [nvme_core]
    [ 605.572376] nvme_ana_work+0x15/0x20 [nvme_core]
    [ 605.573330] process_one_work+0x1db/0x380
    [ 605.574144] worker_thread+0x4d/0x400
    [ 605.574896] kthread+0x104/0x140
    [ 605.577205] ret_from_fork+0x35/0x40
    [ 605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
    [ 605.579239] Tainted: G OE 5.3.5-050305-generic #201910071830
    [ 605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 605.582320] nvme D 0 14044 14043 0x00000000
    [ 605.583424] Call Trace:
    [ 605.583935] __schedule+0x2b9/0x6c0
    [ 605.584625] schedule+0x42/0xb0
    [ 605.585290] schedule_timeout+0x203/0x2f0
    [ 605.588493] wait_for_completion+0xb1/0x120
    [ 605.590066] __flush_work+0x123/0x1d0
    [ 605.591758] __cancel_work_timer+0x10e/0x190
    [ 605.593542] cancel_work_sync+0x10/0x20
    [ 605.594347] nvme_mpath_stop+0x2f/0x40 [nvme_core]
    [ 605.595328] nvme_stop_ctrl+0x12/0x50 [nvme_core]
    [ 605.596262] nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
    [ 605.597333] nvme_sysfs_delete+0x5c/0x70 [nvme_core]
    [ 605.598320] dev_attr_store+0x17/0x30

    Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
    indicate the phase of controller deletion where I/O cannot be allowed
    to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
    be issued to the bottom device, and only after we flush the ana_work
    and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
    we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
    from re-firing by aborting early if we are not LIVE, so we should be safe
    here.

    In addition, change the transport drivers to follow the updated state
    machine.

    Fixes: 0d0b660f214d ("nvme: add ANA support")
    Reported-by: Anton Eidelman
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    Sagi Grimberg
     

24 Jun, 2020

1 commit


11 Jun, 2020

1 commit

  • Asynchronous event notifications do not have an associated request.
    When fcp_io() fails we unconditionally call nvme_cleanup_cmd() which
    leads to a crash.

    Fixes: 16686f3a6c3c ("nvme: move common call to nvme_cleanup_cmd to core layer")
    Signed-off-by: Daniel Wagner
    Reviewed-by: Himanshu Madhani
    Reviewed-by: Hannes Reinecke
    Reviewed-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Daniel Wagner