05 Mar, 2020

4 commits

  • commit 9515743bfb39c61aaf3d4f3219a645c8d1fe9a0e upstream.

    Completions need to consumed in the same order the controller submitted
    them, otherwise future completion entries may overwrite ones we haven't
    handled yet. Hold the nvme queue's poll lock while completing new CQEs to
    prevent another thread from freeing command tags for reuse out-of-order.

    Fixes: dabcefab45d3 ("nvme: provide optimized poll function for separate poll queues")
    Signed-off-by: Bijan Mottahedeh
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Jens Axboe
    Signed-off-by: Keith Busch
    Signed-off-by: Greg Kroah-Hartman

    Bijan Mottahedeh
     
  • [ Upstream commit fa46c6fb5d61b1f17b06d7c6ef75478b576304c7 ]

    Many users have reported nvme triggered irq_startup() warnings during
    shutdown. The driver uses the nvme queue's irq to synchronize scanning
    for completions, and enabling an interrupt affined to only offline CPUs
    triggers the alarming warning.

    Move the final CQE check to after disabling the device and all
    registered interrupts have been torn down so that we do not have any
    IRQ to synchronize.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206509
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Keith Busch
     
  • [ Upstream commit 97b2512ad000a409b4073dd1a71e4157d76675cb ]

    Delayed keep alive work is queued on system workqueue and may be cancelled
    via nvme_stop_keep_alive from nvme_reset_wq, nvme_fc_wq or nvme_wq.

    Check_flush_dependency detects mismatched attributes between the work-queue
    context used to cancel the keep alive work and system-wq. Specifically
    system-wq does not have the WQ_MEM_RECLAIM flag, whereas the contexts used
    to cancel keep alive work have WQ_MEM_RECLAIM flag.

    Example warning:

    workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_fc_reset_ctrl_work [nvme_fc]
    is flushing !WQ_MEM_RECLAIM events:nvme_keep_alive_work [nvme_core]

    To avoid the flags mismatch, delayed keep alive work is queued on nvme_wq.

    However this creates a secondary concern where work and a request to cancel
    that work may be in the same work queue - namely err_work in the rdma and
    tcp transports, which will want to flush/cancel the keep alive work which
    will now be on nvme_wq.

    After reviewing the transports, it looks like err_work can be moved to
    nvme_reset_wq. In fact that aligns them better with transition into
    RESETTING and performing related reset work in nvme_reset_wq.

    Change nvme-rdma and nvme-tcp to perform err_work in nvme_reset_wq.

    Signed-off-by: Nigel Kirkland
    Signed-off-by: James Smart
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Nigel Kirkland
     
  • [ Upstream commit 2d570a7c0251c594489a2c16b82b14ae30345c03 ]

    When nvme_tcp_io_work() fails to send to socket due to
    connection close/reset, error_recovery work is triggered
    from nvme_tcp_state_change() socket callback.
    This cancels all the active requests in the tagset,
    which requeues them.

    The failed request, however, was ended and thus requeued
    individually as well unless send returned -EPIPE.
    Another return code to be treated the same way is -ECONNRESET.

    Double requeue caused BUG_ON(blk_queued_rq(rq))
    in blk_mq_requeue_request() from either the individual requeue
    of the failed request or the bulk requeue from
    blk_mq_tagset_busy_iter(, nvme_cancel_request, );

    Signed-off-by: Anton Eidelman
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Anton Eidelman
     

29 Feb, 2020

1 commit

  • commit 3b7830904e17202524bad1974505a9bfc718d31f upstream.

    kmemleak reports a memory leak with the ana_log_buf allocated by
    nvme_mpath_init():

    unreferenced object 0xffff888120e94000 (size 8208):
    comm "nvme", pid 6884, jiffies 4295020435 (age 78786.312s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
    01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmalloc_order+0x97/0xc0
    [] kmalloc_order_trace+0x24/0x100
    [] __kmalloc+0x24c/0x2d0
    [] nvme_mpath_init+0x23c/0x2b0
    [] nvme_init_identify+0x75f/0x1600
    [] nvme_loop_configure_admin_queue+0x26d/0x280
    [] nvme_loop_create_ctrl+0x2a7/0x710
    [] nvmf_dev_write+0xc66/0x10b9
    [] __vfs_write+0x50/0xa0
    [] vfs_write+0xf3/0x280
    [] ksys_write+0xc6/0x160
    [] __x64_sys_write+0x43/0x50
    [] do_syscall_64+0x77/0x2f0
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    nvme_mpath_init() is called by nvme_init_identify() which is called in
    multiple places (nvme_reset_work(), nvme_passthru_end(), etc). This
    means nvme_mpath_init() may be called multiple times before
    nvme_mpath_uninit() (which is only called on nvme_free_ctrl()).

    When nvme_mpath_init() is called multiple times, it overwrites the
    ana_log_buf pointer with a new allocation, thus leaking the previous
    allocation.

    To fix this, free ana_log_buf before allocating a new one.

    Fixes: 0d0b660f214dc490 ("nvme: add ANA support")
    Cc:
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Logan Gunthorpe
    Signed-off-by: Keith Busch
    Signed-off-by: Greg Kroah-Hartman

    Logan Gunthorpe
     

24 Feb, 2020

2 commits

  • [ Upstream commit cfa27356f835dc7755192e7b941d4f4851acbcc7 ]

    There is no real need to have a pointer to the tagset in
    struct nvme_queue, as we only need it in a single place, and that place
    can derive the used tagset from the device and qid trivially. This
    fixes a problem with stale pointer exposure when tagsets are reset,
    and also shrinks the nvme_queue structure. It also matches what most
    other transports have done since day 1.

    Reported-by: Edmund Nadolski
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin

    Christoph Hellwig
     
  • [ Upstream commit 4ac76436a6d07dec1c3c766f234aa787a16e8f65 ]

    ctrl->subsys->namespaces and subsys->namespaces are traversed with
    list_for_each_entry_rcu outside an RCU read-side critical section but
    under the protection of ctrl->subsys->lock and subsys->lock respectively.

    Hence, add the corresponding lockdep expression to the list traversal
    primitive to silence false-positive lockdep warnings, and harden RCU
    lists.

    Reported-by: kbuild test robot
    Reviewed-by: Joel Fernandes (Google)
    Signed-off-by: Amol Grover
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin

    Amol Grover
     

20 Feb, 2020

1 commit

  • commit f25372ffc3f6c2684b57fb718219137e6ee2b64c upstream.

    nvme fw-activate operation will get bellow warning log,
    fix it by update the parameter order

    [ 113.231513] nvme nvme0: Get FW SLOT INFO log error

    Fixes: 0e98719b0e4b ("nvme: simplify the API for getting log pages")
    Reported-by: Sujith Pandel
    Reviewed-by: David Milburn
    Signed-off-by: Yi Zhang
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Yi Zhang
     

11 Feb, 2020

2 commits

  • commit 1a3f540d63152b8db0a12de508bfa03776217d83 upstream.

    After nvmet_install_queue() sets sq->ctrl calling to nvmet_sq_destroy()
    reduces the controller refcount. In case nvmet_install_queue() fails,
    calling to nvmet_ctrl_put() is done twice (at nvmet_sq_destroy and
    nvmet_execute_io_connect/nvmet_execute_admin_connect) instead of once for
    the queue which leads to use after free of the controller. Fix this by set
    NULL at sq->ctrl in case of a failure at nvmet_install_queue().

    The bug leads to the following Call Trace:

    [65857.994862] refcount_t: underflow; use-after-free.
    [65858.108304] Workqueue: events nvmet_rdma_release_queue_work [nvmet_rdma]
    [65858.115557] RIP: 0010:refcount_warn_saturate+0xe5/0xf0
    [65858.208141] Call Trace:
    [65858.211203] nvmet_sq_destroy+0xe1/0xf0 [nvmet]
    [65858.216383] nvmet_rdma_release_queue_work+0x37/0xf0 [nvmet_rdma]
    [65858.223117] process_one_work+0x167/0x370
    [65858.227776] worker_thread+0x49/0x3e0
    [65858.232089] kthread+0xf5/0x130
    [65858.235895] ? max_active_store+0x80/0x80
    [65858.240504] ? kthread_bind+0x10/0x10
    [65858.244832] ret_from_fork+0x1f/0x30
    [65858.249074] ---[ end trace f82d59250b54beb7 ]---

    Fixes: bb1cc74790eb ("nvmet: implement valid sqhd values in completions")
    Fixes: 1672ddb8d691 ("nvmet: Add install_queue callout")
    Signed-off-by: Israel Rukshin
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Greg Kroah-Hartman

    Israel Rukshin
     
  • commit 0b87a2b795d66be7b54779848ef0f3901c5e46fc upstream.

    Place the arguments in the correct order.

    Fixes: 1672ddb8d691 ("nvmet: Add install_queue callout")
    Signed-off-by: Israel Rukshin
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Greg Kroah-Hartman

    Israel Rukshin
     

09 Jan, 2020

4 commits

  • [ Upstream commit 7e4c6b9a5d22485acf009b3c3510a370f096dd54 ]

    If nvme.write_queues equals the number of CPUs, the driver had decreased
    the number of interrupts available such that there could only be one read
    queue even if the controller could support more. Remove the interrupt
    count reduction in this case. The driver wouldn't request more IRQs than
    it wants queues anyway.

    Reviewed-by: Jens Axboe
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin

    Keith Busch
     
  • [ Upstream commit 3f68baf706ec68c4120867c25bc439c845fe3e17 ]

    The number of poll or write queues should never be negative. Use unsigned
    types so that it's not possible to break have the driver not allocate
    any queues.

    Reviewed-by: Jens Axboe
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin

    Keith Busch
     
  • [ Upstream commit c869e494ef8b5846d9ba91f1e922c23cd444f0c1 ]

    If an error occurs on one of the ios used for creating an
    association, the creating routine has error paths that are
    invoked by the command failure and the error paths will free
    up the controller resources created to that point.

    But... the io was ultimately determined by an asynchronous
    completion routine that detected the error and which
    unconditionally invokes the error_recovery path which calls
    delete_association. Delete association deletes all outstanding
    io then tears down the controller resources. So the
    create_association thread can be running in parallel with
    the error_recovery thread. What was seen was the LLDD received
    a call to delete a queue, causing the LLDD to do a free of a
    resource, then the transport called the delete queue again
    causing the driver to repeat the free call. The second free
    routine corrupted the allocator. The transport shouldn't be
    making the duplicate call, and the delete queue is just one
    of the resources being freed.

    To fix, it is realized that the create_association path is
    completely serialized with one command at a time. So the
    failed io completion will always be seen by the create_association
    path and as of the failure, there are no ios to terminate and there
    is no reason to be manipulating queue freeze states, etc.
    The serialized condition stays true until the controller is
    transitioned to the LIVE state. Thus the fix is to change the
    error recovery path to check the controller state and only
    invoke the teardown path if not already in the CONNECTING state.

    Reviewed-by: Himanshu Madhani
    Reviewed-by: Ewan D. Milne
    Signed-off-by: James Smart
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin

    James Smart
     
  • [ Upstream commit 863fbae929c7a5b64e96b8a3ffb34a29eefb9f8f ]

    In nvme-fc: it's possible to have connected active controllers
    and as no references are taken on the LLDD, the LLDD can be
    unloaded. The controller would enter a reconnect state and as
    long as the LLDD resumed within the reconnect timeout, the
    controller would resume. But if a namespace on the controller
    is the root device, allowing the driver to unload can be problematic.
    To reload the driver, it may require new io to the boot device,
    and as it's no longer connected we get into a catch-22 that
    eventually fails, and the system locks up.

    Fix this issue by taking a module reference for every connected
    controller (which is what the core layer did to the transport
    module). Reference is cleared when the controller is removed.

    Acked-by: Himanshu Madhani
    Reviewed-by: Christoph Hellwig
    Signed-off-by: James Smart
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin

    James Smart
     

31 Dec, 2019

2 commits

  • [ Upstream commit 530436c45ef2e446c12538a400e465929a0b3ade ]

    Users observe IOMMU related errors when performing discard on nvme from
    non-compliant nvme devices reading beyond the end of the DMA mapped
    ranges to discard.

    Two different variants of this behavior have been observed: SM22XX
    controllers round up the read size to a multiple of 512 bytes, and Phison
    E12 unconditionally reads the maximum discard size allowed by the spec
    (256 segments or 4kB).

    Make nvme_setup_discard unconditionally allocate the maximum DSM buffer
    so the driver DMA maps a memory range that will always succeed.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=202665 many
    Signed-off-by: Eduard Hasenleithner
    [changelog, use existing define, kernel coding style]
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin

    Eduard Hasenleithner
     
  • [ Upstream commit 2dc3947b53f573e8a75ea9cbec5588df88ca502e ]

    Fix the status code of canceled requests initiated by the host according
    to TP4028 (Status Code 0x371):
    "Command Aborted By host: The command was aborted as a result of host
    action (e.g., the host disconnected the Fabric connection)."

    Also in a multipath environment, unless otherwise specified, errors of
    this type (path related) should be retried using a different path, if
    one is available.

    Signed-off-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Max Gurtovoy
     

18 Dec, 2019

2 commits

  • commit 655e7aee1f0398602627a485f7dca6c29cc96cae upstream.

    Since e045fa29e893 ("PCI/MSI: Fix incorrect MSI-X masking on resume") is
    merged, we can revert the previous quirk now.

    This reverts commit 19ea025e1d28c629b369c3532a85b3df478cc5c6.

    Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887
    Fixes: 19ea025e1d28 ("nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T")
    Link: https://lore.kernel.org/r/20191031093408.9322-1-jian-hong@endlessm.com
    Signed-off-by: Jian-Hong Pan
    Signed-off-by: Bjorn Helgaas
    Acked-by: Christoph Hellwig
    Cc: stable@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Jian-Hong Pan
     
  • commit 22802bf742c25b1e2473c70b3b99da98af65ef4d upstream.

    Despite NVM Express specification 1.3 requires a controller claiming to
    be 1.3 or higher implement Identify CNS 03h (Namespace Identification
    Descriptor list), the driver doesn't really need this identification in
    order to use a namespace. The code had already documented in comments
    that we're not to consider an error to this command.

    Return success if the controller provided any response to an
    namespace identification descriptors command.

    Fixes: 538af88ea7d9de24 ("nvme: make nvme_report_ns_ids propagate error back")
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=205679
    Reported-by: Ingo Brunberg
    Cc: Sagi Grimberg
    Cc: stable@vger.kernel.org # 5.4+
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Greg Kroah-Hartman

    Keith Busch
     

09 Nov, 2019

1 commit

  • Pull block fixes from Jens Axboe:

    - Two NVMe device removal crash fixes, and a compat fixup for for an
    ioctl that was introduced in this release (Anton, Charles, Max - via
    Keith)

    - Missing error path mutex unlock for drbd (Dan)

    - cgroup writeback fixup on dead memcg (Tejun)

    - blkcg online stats print fix (Tejun)

    * tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block:
    cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
    block: drbd: remove a stray unlock in __drbd_send_protocol()
    blkcg: make blkcg_print_stat() print stats only for online blkgs
    nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
    nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
    nvme-rdma: fix a segmentation fault during module unload

    Linus Torvalds
     

05 Nov, 2019

2 commits

  • nvme_mpath_clear_ctrl_paths() iterates through
    the ctrl->namespaces list while holding ctrl->scan_lock.
    This does not seem to be the correct way of protecting
    from concurrent list modification.

    Specifically, nvme_scan_work() sorts ctrl->namespaces
    AFTER unlocking scan_lock.

    This may result in the following (rare) crash in ctrl disconnect
    during scan_work:

    BUG: kernel NULL pointer dereference, address: 0000000000000050
    Oops: 0000 [#1] SMP PTI
    CPU: 0 PID: 3995 Comm: nvme 5.3.5-050305-generic
    RIP: 0010:nvme_mpath_clear_current_path+0xe/0x90 [nvme_core]
    ...
    Call Trace:
    nvme_mpath_clear_ctrl_paths+0x3c/0x70 [nvme_core]
    nvme_remove_namespaces+0x35/0xe0 [nvme_core]
    nvme_do_delete_ctrl+0x47/0x90 [nvme_core]
    nvme_sysfs_delete+0x49/0x60 [nvme_core]
    dev_attr_store+0x17/0x30
    sysfs_kf_write+0x3e/0x50
    kernfs_fop_write+0x11e/0x1a0
    __vfs_write+0x1b/0x40
    vfs_write+0xb9/0x1a0
    ksys_write+0x67/0xe0
    __x64_sys_write+0x1a/0x20
    do_syscall_64+0x5a/0x130
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f8d02bfb154

    Fix:
    After taking scan_lock in nvme_mpath_clear_ctrl_paths()
    down_read(&ctrl->namespaces_rwsem) as well to make list traversal safe.
    This will not cause deadlocks because taking scan_lock never happens
    while holding the namespaces_rwsem.
    Moreover, scan work downs namespaces_rwsem in the same order.

    Alternative: sort ctrl->namespaces in nvme_scan_work()
    while still holding the scan_lock.
    This would leave nvme_mpath_clear_ctrl_paths() without correct protection
    against ctrl->namespaces modification by anyone other than scan_work.

    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Anton Eidelman
    Signed-off-by: Keith Busch

    Anton Eidelman
     
  • In case there are controllers that are not associated with any RDMA
    device (e.g. during unsuccessful reconnection) and the user will unload
    the module, these controllers will not be freed and will access already
    freed memory. The same logic appears in other fabric drivers as well.

    Fixes: 87fd125344d6 ("nvme-rdma: remove redundant reference between ib_device and tagset")
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Max Gurtovoy
    Signed-off-by: Keith Busch

    Max Gurtovoy
     

02 Nov, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) Fix free/alloc races in batmanadv, from Sven Eckelmann.

    2) Several leaks and other fixes in kTLS support of mlx5 driver, from
    Tariq Toukan.

    3) BPF devmap_hash cost calculation can overflow on 32-bit, from Toke
    Høiland-Jørgensen.

    4) Add an r8152 device ID, from Kazutoshi Noguchi.

    5) Missing include in ipv6's addrconf.c, from Ben Dooks.

    6) Use siphash in flow dissector, from Eric Dumazet. Attackers can
    easily infer the 32-bit secret otherwise etc.

    7) Several netdevice nesting depth fixes from Taehee Yoo.

    8) Fix several KCSAN reported errors, from Eric Dumazet. For example,
    when doing lockless skb_queue_empty() checks, and accessing
    sk_napi_id/sk_incoming_cpu lockless as well.

    9) Fix jumbo packet handling in RXRPC, from David Howells.

    10) Bump SOMAXCONN and tcp_max_syn_backlog values, from Eric Dumazet.

    11) Fix DMA synchronization in gve driver, from Yangchun Fu.

    12) Several bpf offload fixes, from Jakub Kicinski.

    13) Fix sk_page_frag() recursion during memory reclaim, from Tejun Heo.

    14) Fix ping latency during high traffic rates in hisilicon driver, from
    Jiangfent Xiao.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits)
    net: fix installing orphaned programs
    net: cls_bpf: fix NULL deref on offload filter removal
    selftests: bpf: Skip write only files in debugfs
    selftests: net: reuseport_dualstack: fix uninitalized parameter
    r8169: fix wrong PHY ID issue with RTL8168dp
    net: dsa: bcm_sf2: Fix IMP setup for port different than 8
    net: phylink: Fix phylink_dbg() macro
    gve: Fixes DMA synchronization.
    inet: stop leaking jiffies on the wire
    ixgbe: Remove duplicate clear_bit() call
    Documentation: networking: device drivers: Remove stray asterisks
    e1000: fix memory leaks
    i40e: Fix receive buffer starvation for AF_XDP
    igb: Fix constant media auto sense switching when no cable is connected
    net: ethernet: arc: add the missed clk_disable_unprepare
    igb: Enable media autosense for the i350.
    igb/igc: Don't warn on fatal read failures when the device is removed
    tcp: increase tcp_max_syn_backlog max value
    net: increase SOMAXCONN to 4096
    netdevsim: Fix use-after-free during device dismantle
    ...

    Linus Torvalds
     

29 Oct, 2019

3 commits

  • groups_only mode in nvme_read_ana_log() is no longer used: remove it.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Anton Eidelman
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Anton Eidelman
     
  • The following scenario results in an IO hang:
    1) ctrl completes a request with NVME_SC_ANA_TRANSITION.
    NVME_NS_ANA_PENDING bit in ns->flags is set and ana_work is triggered.
    2) ana_work: nvme_read_ana_log() tries to get the ANA log page from the ctrl.
    This fails because ctrl disconnects.
    Therefore nvme_update_ns_ana_state() is not called
    and NVME_NS_ANA_PENDING bit in ns->flags is not cleared.
    3) ctrl reconnects: nvme_mpath_init(ctrl,...) calls
    nvme_read_ana_log(ctrl, groups_only=true).
    However, nvme_update_ana_state() does not update namespaces
    because nr_nsids = 0 (due to groups_only mode).
    4) scan_work calls nvme_validate_ns() finds the ns and re-validates OK.

    Result:
    The ctrl is now live but NVME_NS_ANA_PENDING bit in ns->flags is still set.
    Consequently ctrl will never be considered a viable path by __nvme_find_path().
    IO will hang if ctrl is the only or the last path to the namespace.

    More generally, while ctrl is reconnecting, its ANA state may change.
    And because nvme_mpath_init() requests ANA log in groups_only mode,
    these changes are not propagated to the existing ctrl namespaces.
    This may result in a mal-function or an IO hang.

    Solution:
    nvme_mpath_init() will nvme_read_ana_log() with groups_only set to false.
    This will not harm the new ctrl case (no namespaces present),
    and will make sure the ANA state of namespaces gets updated after reconnect.

    Note: Another option would be for nvme_mpath_init() to invoke
    nvme_parse_ana_log(..., nvme_set_ns_ana_state) for each existing namespace.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Anton Eidelman
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Anton Eidelman
     
  • Busy polling usually runs without locks.
    Let's use skb_queue_empty_lockless() instead of skb_queue_empty()

    Also uses READ_ONCE() in __skb_try_recv_datagram() to address
    a similar potential problem.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Oct, 2019

1 commit

  • In the current code, the nvme is using a fixed 4k PRP entry size,
    but if the kernel use a page size which is more than 4k, we should
    consider the situation that the bv_offset may be larger than the
    dev->ctrl.page_size. Otherwise we may miss setting the prp2 and then
    cause the command can't be executed correctly.

    Fixes: dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests")
    Cc: stable@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Kevin Hao
    Signed-off-by: Keith Busch

    Kevin Hao
     

15 Oct, 2019

2 commits


14 Oct, 2019

6 commits

  • The access to sk->sk_ll_usec should be hidden behind
    CONFIG_NET_RX_BUSY_POLL like the definition of sk_ll_usec.

    Put access to ->sk_ll_usec behind CONFIG_NET_RX_BUSY_POLL.

    Fixes: 1a9460cef5711 ("nvme-tcp: support simple polling")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Keith Busch

    Sebastian Andrzej Siewior
     
  • Prevent simultaneous controller disabling/enabling tasks from interfering
    with each other through a function to wait until the task successfully
    transitioned the controller to the RESETTING state. This ensures disabling
    the controller will not be interrupted by another reset path, otherwise
    a concurrent reset may leave the controller in the wrong state.

    Tested-by: Edmund Nadolski
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch

    Keith Busch
     
  • A paused controller is doing critical internal activation work in the
    background. Prevent subsequent controller resets from occurring during
    this period by setting the controller state to RESETTING first. A helper
    function, nvme_try_sched_reset_work(), is introduced for these paths so
    they may continue with scheduling the reset_work after they've completed
    their uninterruptible critical section.

    Tested-by: Edmund Nadolski
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch

    Keith Busch
     
  • A controller in the resetting state has not yet completed its recovery
    actions. The pci and fc transports were already handling this, so update
    the remaining transports to not attempt additional recovery in this
    state. Instead, just restart the request timer.

    Tested-by: Edmund Nadolski
    Reviewed-by: James Smart
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch

    Keith Busch
     
  • The admin only state was intended to fence off actions that don't
    apply to a non-IO capable controller. The only actual user of this is
    the scan_work, and pci was the only transport to ever set this state.
    The consequence of having this state is placing an additional burden on
    every other action that applies to both live and admin only controllers.

    Remove the admin only state and place the admin only burden on the only
    place that actually cares: scan_work.

    This also prepares to make it easier to temporarily pause a LIVE state
    so that we don't need to remember which state the controller had been in
    prior to the pause.

    Tested-by: Edmund Nadolski
    Reviewed-by: James Smart
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch

    Keith Busch
     
  • If a controller becomes degraded after a reset, we will not be able to
    perform any IO. We currently teardown previously created request
    queues and namespaces, but we had kept the unusable tagset. Free
    it after all queues using it have been released.

    Tested-by: Edmund Nadolski
    Reviewed-by: James Smart
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch

    Keith Busch
     

05 Oct, 2019

2 commits

  • Commit 7fd8930f26be4

    "nvme: add a common helper to read Identify Controller data"

    has re-introduced an issue that we have attempted to work around in the
    past, in commit a310acd7a7ea ("NVMe: use split lo_hi_{read,write}q").

    The problem is that some PCIe NVMe controllers do not implement 64-bit
    outbound accesses correctly, which is why the commit above switched
    to using lo_hi_[read|write]q for all 64-bit BAR accesses occuring in
    the code.

    In the mean time, the NVMe subsystem has been refactored, and now calls
    into the PCIe support layer for NVMe via a .reg_read64() method, which
    fails to use lo_hi_readq(), and thus reintroduces the problem that the
    workaround above aimed to address.

    Given that, at the moment, .reg_read64() is only used to read the
    capability register [which is known to tolerate split reads], let's
    switch .reg_read64() to lo_hi_readq() as well.

    This fixes a boot issue on some ARM boxes with NVMe behind a Synopsys
    DesignWare PCIe host controller.

    Fixes: 7fd8930f26be4 ("nvme: add a common helper to read Identify Controller data")
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Sagi Grimberg

    Ard Biesheuvel
     
  • nvme_update_formats may fail to revalidate the namespace and
    attempt to remove the namespace. This may lead to a deadlock
    as nvme_ns_remove will attempt to acquire the subsystem lock
    which is already acquired by the passthru command with effects.

    Move the invalid namepsace removal to after the passthru command
    releases the subsystem lock.

    Reported-by: Judy Brock
    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     

28 Sep, 2019

2 commits

  • Pull NVMe changes from Sagi:

    "This set consists of various fixes and cleanups:
    - controller removal race fix from Balbir
    - quirk additions from Gabriel and Jian-Hong
    - nvme-pci power state save fix from Mario
    - Add 64bit user commands (for 64bit registers) from Marta
    - nvme-rdma/nvme-tcp fixes from Max, Mark and Me
    - Minor cleanups and nits from James, Dan and John"

    * 'nvme-5.4' of git://git.infradead.org/nvme:
    nvme-rdma: fix possible use-after-free in connect timeout
    nvme: Move ctrl sqsize to generic space
    nvme: Add ctrl attributes for queue_count and sqsize
    nvme: allow 64-bit results in passthru commands
    nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T
    nvmet-tcp: remove superflous check on request sgl
    Added QUIRKs for ADATA XPG SX8200 Pro 512GB
    nvme-rdma: Fix max_hw_sectors calculation
    nvme: fix an error code in nvme_init_subsystem()
    nvme-pci: Save PCI state before putting drive into deepest state
    nvme-tcp: fix wrong stop condition in io_work
    nvme-pci: Fix a race in controller removal
    nvmet: change ppl to lpp

    Jens Axboe
     
  • If the connect times out, we may have already destroyed the
    queue in the timeout handler, so test if the queue is still
    allocated in the connect error handler.

    Reported-by: Yi Zhang
    Signed-off-by: Sagi Grimberg

    Sagi Grimberg
     

27 Sep, 2019

1 commit


26 Sep, 2019

1 commit

  • Current controller interrogation requires a lot of guesswork
    on how many io queues were created and what the io sq size is.
    The numbers are dependent upon core/fabric defaults, connect
    arguments, and target responses.

    Add sysfs attributes for queue_count and sqsize.

    Signed-off-by: James Smart
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Sagi Grimberg

    James Smart