29 Oct, 2018

1 commit


13 Oct, 2018

1 commit

  • commit cf25809bec2c7df4b45df5b2196845d9a4a3c89b upstream.

    If there are errors during initial controller create, the transport
    will teardown the partially initialized controller struct and free
    the ctlr memory. Trouble is - most of those errors can occur due
    to asynchronous events happening such io timeouts and subsystem
    connectivity failures. Those failures invoke async workq items to
    reset the controller and attempt reconnect. Those may be in progress
    as the main thread frees the ctrl memory, resulting in NULL ptr oops.

    Prevent this from happening by having the main ctrl failure thread
    changing state to DELETING followed by synchronously cancelling any
    pending queued work item. The change of state will prevent the
    scheduling of resets or reconnect events.

    Signed-off-by: James Smart
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Amit Pundir
    Signed-off-by: Greg Kroah-Hartman

    James Smart
     

10 Oct, 2018

1 commit

  • [ Upstream commit 8407879c4e0d7731f6e7e905893cecf61a7762c7 ]

    Currently we always repost the recv buffer before we send a response
    capsule back to the host. Since ordering is not guaranteed for send
    and recv completions, it is posible that we will receive a new request
    from the host before we got a send completion for the response capsule.

    Today, we pre-allocate 2x rsps the length of the queue, but in reality,
    under heavy load there is nothing that is really preventing the gap to
    expand until we exhaust all our rsps.

    To fix this, if we don't have any pre-allocated rsps left, we dynamically
    allocate a rsp and make sure to free it when we are done. If under memory
    pressure we fail to allocate a rsp, we silently drop the command and
    wait for the host to retry.

    Reported-by: Steve Wise
    Tested-by: Steve Wise
    Signed-off-by: Sagi Grimberg
    [hch: dropped a superflous assignment]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     

04 Oct, 2018

1 commit

  • [ Upstream commit afd299ca996929f4f98ac20da0044c0cdc124879 ]

    When a targetport is removed from the config, fcloop will avoid calling
    the LS done() routine thinking the targetport is gone. This leaves the
    initiator reset/reconnect hanging as it waits for a status on the
    Create_Association LS for the reconnect.

    Change the filter in the LS callback path. If tport null (set when
    failed validation before "sending to remote port"), be sure to call
    done. This was the main bug. But, continue the logic that only calls
    done if tport was set but there is no remoteport (e.g. case where
    remoteport has been removed, thus host doesn't expect a completion).

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    James Smart
     

26 Sep, 2018

1 commit

  • [ Upstream commit 90140624e8face94207003ac9a9d2a329b309d68 ]

    If the controller is going away, we need to unquiesce the IO queues so
    that all pending request can fail gracefully before moving forward with
    controller deletion. Do that before we destroy the IO queues so
    blk_cleanup_queue won't block in freeze.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     

05 Sep, 2018

1 commit

  • commit f1ed3df20d2d223e0852cc4ac1f19bba869a7e3c upstream.

    In many architectures loads may be reordered with older stores to
    different locations. In the nvme driver the following two operations
    could be reordered:

    - Write shadow doorbell (dbbuf_db) into memory.
    - Read EventIdx (dbbuf_ei) from memory.

    This can result in a potential race condition between driver and VM host
    processing requests (if given virtual NVMe controller has a support for
    shadow doorbell). If that occurs, then the NVMe controller may decide to
    wait for MMIO doorbell from guest operating system, and guest driver may
    decide not to issue MMIO doorbell on any of subsequent commands.

    This issue is purely timing-dependent one, so there is no easy way to
    reproduce it. Currently the easiest known approach is to run "Oracle IO
    Numbers" (orion) that is shipped with Oracle DB:

    orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
    concat -write 40 -duration 120 -matrix row -testname nvme_test

    Where nvme_test is a .lun file that contains a list of NVMe block
    devices to run test against. Limiting number of vCPUs assigned to given
    VM instance seems to increase chances for this bug to occur. On test
    environment with VM that got 4 NVMe drives and 1 vCPU assigned the
    virtual NVMe controller hang could be observed within 10-20 minutes.
    That correspond to about 400-500k IO operations processed (or about
    100GB of IO read/writes).

    Orion tool was used as a validation and set to run in a loop for 36
    hours (equivalent of pushing 550M IO operations). No issues were
    observed. That suggest that the patch fixes the issue.

    Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices")
    Signed-off-by: Michal Wnukowski
    Reviewed-by: Keith Busch
    Reviewed-by: Sagi Grimberg
    [hch: updated changelog and comment a bit]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Michal Wnukowski
     

24 Aug, 2018

2 commits

  • [ Upstream commit 9b382768135ee3ff282f828c906574a8478e036b ]

    The old code in nvme_user_cmd() passed the userspace virtual address
    from nvme_passthru_cmd.metadata as the length of the metadata buffer
    as well as the address to nvme_submit_user_cmd().

    Fixes: 63263d60 ("nvme: Use metadata for passthrough commands")
    Signed-off-by: Roland Dreier
    Reviewed-by: Keith Busch
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Roland Dreier
     
  • [ Upstream commit d68a90e148f5a82aa67654c5012071e31c0e4baa ]

    Controllers that are not yet enabled should not really enforce keep alive
    timeouts, but we still want to track a timeout and cleanup in case a host
    died before it enabled the controller. Hence, simply reset the keep
    alive timer when the controller is enabled.

    Suggested-by: Max Gurtovoy
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Max Gurtuvoy
     

09 Aug, 2018

3 commits

  • commit d082dc1562a2ff0947b214796f12faaa87e816a9 upstream.

    The existing code to carve up the sg list expected an sg element-per-page
    which can be very incorrect with iommu's remapping multiple memory pages
    to fewer bus addresses. To hit this error required a large io payload
    (greater than 256k) and a system that maps on a per-page basis. It's
    possible that large ios could get by fine if the system condensed the
    sgl list into the first 64 elements.

    This patch corrects the sg list handling by specifically walking the
    sg list element by element and attempting to divide the transfer up
    on a per-sg element boundary. While doing so, it still tries to keep
    sequences under 256k, but will exceed that rule if a single sg element
    is larger than 256k.

    Fixes: 48fa362b6c3f ("nvmet-fc: simplify sg list handling")
    Cc: # 4.14
    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    James Smart
     
  • commit 62314e405fa101dbb82563394f9dfc225e3f1167 upstream.

    The queue count says the highest queue that's been allocated, so don't
    reallocate a queue lower than that.

    Fixes: 147b27e4bd0 ("nvme-pci: allocate device queues storage space at probe")
    Signed-off-by: Keith Busch
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jon Derrick
    Signed-off-by: Greg Kroah-Hartman

    Keith Busch
     
  • commit 147b27e4bd08406a6abebedbb478b431ec197be1 upstream.

    It may cause race by setting 'nvmeq' in nvme_init_request()
    because .init_request is called inside switching io scheduler, which
    may happen when the NVMe device is being resetted and its nvme queues
    are being freed and created. We don't have any sync between the two
    pathes.

    This patch changes the nvmeq allocation to occur at probe time so
    there is no way we can dereference it at init_request.

    [ 93.268391] kernel BUG at drivers/nvme/host/pci.c:408!
    [ 93.274146] invalid opcode: 0000 [#1] SMP
    [ 93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
    nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat
    intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
    kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
    intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me
    ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei
    shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod
    mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
    fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3
    megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
    [ 93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4
    [ 93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
    [ 93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1
    [ 93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme]
    [ 93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246
    [ 93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
    [ 93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008
    [ 93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00
    [ 93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00
    [ 93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071
    [ 93.422969] FS: 00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000
    [ 93.431996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0
    [ 93.446368] Call Trace:
    [ 93.449103] blk_mq_alloc_rqs+0x231/0x2a0
    [ 93.453579] blk_mq_sched_alloc_tags.isra.8+0x42/0x80
    [ 93.459214] blk_mq_init_sched+0x7e/0x140
    [ 93.463687] elevator_switch+0x5a/0x1f0
    [ 93.467966] ? elevator_get.isra.17+0x52/0xc0
    [ 93.472826] elv_iosched_store+0xde/0x150
    [ 93.477299] queue_attr_store+0x4e/0x90
    [ 93.481580] kernfs_fop_write+0xfa/0x180
    [ 93.485958] __vfs_write+0x33/0x170
    [ 93.489851] ? __inode_security_revalidate+0x4c/0x60
    [ 93.495390] ? selinux_file_permission+0xda/0x130
    [ 93.500641] ? _cond_resched+0x15/0x30
    [ 93.504815] vfs_write+0xad/0x1a0
    [ 93.508512] SyS_write+0x52/0xc0
    [ 93.512113] do_syscall_64+0x61/0x1a0
    [ 93.516199] entry_SYSCALL64_slow_path+0x25/0x25
    [ 93.521351] RIP: 0033:0x7f33ce96aab0
    [ 93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0
    [ 93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001
    [ 93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740
    [ 93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400
    [ 93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000
    [ 93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0
    74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de
    0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89
    [ 93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8
    [ 93.602273] ---[ end trace 810dde3993e5f14e ]---

    Reported-by: Yi Zhang
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jon Derrick
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     

03 Aug, 2018

3 commits

  • [ Upstream commit ea48e877994f086af481427bac110aa63686c3ce ]

    Add a new lightnvm quirk to identify CNEX’s Granby controller.

    Signed-off-by: Wei Xu
    Reviewed-by: Javier González
    Reviewed-by: Matias Bjørling
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Wei Xu
     
  • [ Upstream commit 72cd4cc28e234ed7189ee508ed65ab60c80a97c8 ]

    The nvme timeout handling doesn't do anything if the pci channel is
    offline, which is the case when recovering from PCI error event, so it
    was a bad idea to sync the controller reset in this state. This patch
    flushes the reset work in the error_resume callback instead when the
    channel is back to online. This keeps AER handling serialized and
    can recover from timeouts.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=199757
    Fixes: cc1d5e749a2e ("nvme/pci: Sync controller reset for AER slot_reset")
    Reported-by: Alex Gagniuc
    Tested-by: Alex Gagniuc
    Signed-off-by: Keith Busch
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Keith Busch
     
  • [ Upstream commit 2e050f00a0f0e07467050cb4afae0234941e5bf3 ]

    For any failure after nvme_rdma_start_queue in
    nvme_rdma_configure_admin_queue, the admin queue will be freed with the
    NVME_RDMA_Q_LIVE flag still set. Once nvme_rdma_stop_queue is invoked,
    that will cause a use-after-free.
    BUG: KASAN: use-after-free in rdma_disconnect+0x1f/0xe0 [rdma_cm]

    To fix it, call nvme_rdma_stop_queue for all the failed cases after
    nvme_rdma_start_queue.

    Signed-off-by: Jianchao Wang
    Suggested-by: Sagi Grimberg
    Reviewed-by: Max Gurtovoy
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jianchao Wang
     

17 Jul, 2018

1 commit

  • commit 815c6704bf9f1c59f3a6be380a4032b9c57b12f1 upstream.

    The controller memory buffer is remapped into a kernel address on each
    reset, but the driver was setting the submission queue base address
    only on the very first queue creation. The remapped address is likely to
    change after a reset, so accessing the old address will hit a kernel bug.

    This patch fixes that by setting the queue's CMB base address each time
    the queue is created.

    Fixes: f63572dff1421 ("nvme: unmap CMB and remove sysfs file in reset path")
    Reported-by: Christian Black
    Cc: Jon Derrick
    Cc: # 4.9+
    Signed-off-by: Keith Busch
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Scott Bauer
    Reviewed-by: Jon Derrick
    Signed-off-by: Greg Kroah-Hartman

    Keith Busch
     

21 Jun, 2018

4 commits

  • [ Upstream commit f31a21103c03bb62846409fdc60cc9faf2398cfb ]

    If the command a separate metadata buffer attached, the request needs
    to have the integrity flag set so the driver knows to map it.

    Signed-off-by: Keith Busch
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Keith Busch
     
  • [ Upstream commit 59a2f3f00fd744dbad22593f47552037d3154ca6 ]

    When specifying same string type option several times,
    current option parsing may cause memory leak. Hence,
    call kfree for previous one in this case.

    Signed-off-by: Chengguang Xu
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chengguang Xu
     
  • [ Upstream commit d6fc6a22fc7d3df987666725496ed5dd2dd30f23 ]

    NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.
    So declare the kconfig dependency. This is necessary to allow for
    enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

    Signed-off-by: Greg Thelen
    Cc: Tarick Bedeir
    Signed-off-by: Doug Ledford
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Greg Thelen
     
  • [ Upstream commit 3af7a156bdc356946098e13180be66b6420619bf ]

    NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols. So
    declare the kconfig dependency. This is necessary to allow for enabling
    INFINIBAND without INFINIBAND_ADDR_TRANS.

    Signed-off-by: Greg Thelen
    Cc: Tarick Bedeir
    Signed-off-by: Doug Ledford
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Greg Thelen
     

30 May, 2018

6 commits

  • [ Upstream commit 467c77d4cbefaaf65e2f44fe102d543a52fcae5b ]

    Yet another "incompatible" Samsung NVMe SSD 960 EVO and Asus motherboard
    combination. 960 EVO device disappears from PCIe bus within few minutes
    after boot-up when APST is in use and never gets back. Forcing
    NVME_QUIRK_NO_APST is the only way to make this drive work with this
    particular motherboard. NVME_QUIRK_NO_DEEPEST_PS doesn't work, upgrading
    motherboard's BIOS didn't help either.
    Since this is a desktop motherboard, the only drawback of not using APST
    is increased device temperature.

    Signed-off-by: Jarosław Janik
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jarosław Janik
     
  • [ Upstream commit 74c6c71530847808d4e3be7b205719270efee80c ]

    NVMe over Fabrics 1.0 Section 5.2 "Discovery Controller Properties and
    Command Support" Figure 31 "Discovery Controller – Admin Commands"
    explicitly listst all commands but "Get Log Page" and "Identify" as
    reserved, but NetApp report the Linux host is sending Keep Alive
    commands to the discovery controller, which is a violation of the
    Spec.

    We're already checking for discovery controllers when configuring the
    keep alive timeout but when creating a discovery controller we're not
    hard wiring the keep alive timeout to 0 and thus remain on
    NVME_DEFAULT_KATO for the discovery controller.

    This can be easily remproduced when issuing a direct connect to the
    discovery susbsystem using:
    'nvme connect [...] --nqn=nqn.2014-08.org.nvmexpress.discovery'

    Signed-off-by: Johannes Thumshirn
    Fixes: 07bfcd09a288 ("nvme-fabrics: add a generic NVMe over Fabrics library")
    Reported-by: Martin George
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Johannes Thumshirn
     
  • [ Upstream commit 16ccfff2897613007b5eda9e29d65303c6280026 ]

    84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
    has switched to do irq vectors spread among all possible CPUs, so
    pass num_possible_cpus() as max vecotrs to be assigned.

    For example, in a 8 cores system, 0~3 online, 4~8 offline/not present,
    see 'lscpu':

    [ming@box]$lscpu
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 4
    On-line CPU(s) list: 0-3
    Thread(s) per core: 1
    Core(s) per socket: 2
    Socket(s): 2
    NUMA node(s): 2
    ...
    NUMA node0 CPU(s): 0-3
    NUMA node1 CPU(s):
    ...

    1) before this patch, follows the allocated vectors and their affinity:
    irq 47, cpu list 0,4
    irq 48, cpu list 1,6
    irq 49, cpu list 2,5
    irq 50, cpu list 3,7

    2) after this patch, follows the allocated vectors and their affinity:
    irq 43, cpu list 0
    irq 44, cpu list 1
    irq 45, cpu list 2
    irq 46, cpu list 3
    irq 47, cpu list 4
    irq 48, cpu list 6
    irq 49, cpu list 5
    irq 50, cpu list 7

    Cc: Keith Busch
    Cc: Sagi Grimberg
    Cc: Thomas Gleixner
    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • [ Upstream commit 651438bb0af5213f1f70d66e75bf11d08cb5537a ]

    Triggering PPC EEH detection and handling requires a memory mapped read
    failure. The NVMe driver removed the periodic health check MMIO, so
    there's no early detection mechanism to trigger the recovery. Instead,
    the detection now happens when the nvme driver handles an IO timeout
    event. This takes the pci channel offline, so we do not want the driver
    to proceed with escalating its own recovery efforts that may conflict
    with the EEH handler.

    This patch ensures the driver will observe the channel was set to offline
    after a failed MMIO read and resets the IO timer so the EEH handler has
    a chance to recover the device.

    Signed-off-by: Wen Xiong
    [updated change log]
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Wen Xiong
     
  • [ Upstream commit bffd2b61670feef18d2535e9b53364d270a1c991 ]

    PSDT field section according to NVM_Express-1.3:
    "This field specifies whether PRPs or SGLs are used for any data
    transfer associated with the command. PRPs shall be used for all
    Admin commands for NVMe over PCIe. SGLs shall be used for all Admin
    and I/O commands for NVMe over Fabrics. This field shall be set to
    01b for NVMe over Fabrics 1.0 implementations.

    Suggested-by: Idan Burstein
    Signed-off-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Max Gurtovoy
     
  • [ Upstream commit f25a2dfc20e3a3ed8fe6618c331799dd7bd01190 ]

    This patch fixes nvme queue cleanup if requesting an IRQ handler for
    the queue's vector fails. It does this by resetting the cq_vector to
    the uninitialized value of -1 so it is ignored for a controller reset.

    Signed-off-by: Jianchao Wang
    [changelog updates, removed misc whitespace changes]
    Signed-off-by: Keith Busch
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jianchao Wang
     

16 May, 2018

1 commit

  • commit 9abd68ef454c824bfd18629033367b4382b5f390 upstream.

    Some P3100 drives have a bug where they think WRRU (weighted round robin)
    is always enabled, even though the host doesn't set it. Since they think
    it's enabled, they also look at the submission queue creation priority. We
    used to set that to MEDIUM by default, but that was removed in commit
    81c1cd98351b. This causes various issues on that drive. Add a quirk to
    still set MEDIUM priority for that controller.

    Fixes: 81c1cd98351b ("nvme/pci: Don't set reserved SQ create flags")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Keith Busch
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

12 Apr, 2018

2 commits

  • [ Upstream commit 278e096063f1914fccfc77a617be9fc8dbb31b0e ]

    A test case revealed a race condition of an i/o completing on a thread
    parallel to the delete_association generating the aborts for the
    outstanding ios on the controller. The i/o completion was freeing the
    target fcloop context, thus the abort task referenced the just-freed
    memory.

    Correct by clearing the target/initiator cross pointers in the io
    completion and abort tasks before calling the callbacks. On aborts
    that detect already finished io's, ensure the complete context is
    called.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    James Smart
     
  • [ Upstream commit 6fda20283e55b9d288cd56822ce39fc8e64f2208 ]

    The current fcloop driver gets its lport structure from the private
    area co-allocated with the fc_localport. All is fine except the
    teardown path, which wants to wait on the completion, which is marked
    complete by the delete_localport callback performed after
    unregister_localport. The issue is, the nvme_fc transport frees the
    localport structure immediately after delete_localport is called,
    meaning the original routine is trying to wait on a complete that
    was just freed.

    Change such that a lport struct is allocated coincident with the
    addition and registration of a localport. The private area of the
    localport now contains just a backpointer to the real lport struct.
    Now, the completion can be waited for, and after completing, the
    new structure can be kfree'd.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    James Smart
     

09 Mar, 2018

1 commit

  • commit b4b591c87f2b0f4ebaf3a68d4f13873b241aa584 upstream.

    The entire completions suppress mechanism is currently broken because the
    HCA might retry a send operation (due to dropped ack) after the nvme
    transaction has completed.

    In order to handle this, we signal all send completions and introduce a
    separate done handler for async events as they will be handled differently
    (as they don't include in-capsule data by definition).

    Signed-off-by: Sagi Grimberg
    Reviewed-by: Max Gurtovoy
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     

03 Mar, 2018

3 commits

  • [ Upstream commit 6b018235b4daabae96d855219fae59c3fb8be417 ]

    The field was uninitialized before use.

    Signed-off-by: Ewan D. Milne
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ewan D. Milne
     
  • [ Upstream commit 249159c5f15812140fa216f9997d799ac0023a1f ]

    Some devices with IDs matching the "stripe" quirk don't actually have
    this quirk, and don't have an MDTS value. When MDTS is not set, the
    driver sets the max sectors to UINT_MAX, which is not a power of 2,
    hitting a BUG_ON from blk_queue_chunk_sectors. This patch skips setting
    chunk sectors for such devices.

    Signed-off-by: Keith Busch
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Keith Busch
     
  • [ Upstream commit 4596e752db02d47038cd7c965419789ab15d1985 ]

    There are two put references in the failure case of initial
    create_association. The first put actually frees the controller, thus the
    second put references freed memory.

    Remove the unnecessary 2nd put.

    Signed-off-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    James Smart
     

04 Feb, 2018

8 commits

  • [ Upstream commit 7e5dd57ef3081ff6c03908d786ed5087f6fbb7ae ]

    Following condition which will cause NULL pointer dereference will
    occur in nvme_free_host_mem() when it tries to remove pci device via
    nvme_remove() especially after a failure of host memory allocation for HMB.

    "(host_mem_descs == NULL) && (nr_host_mem_descs != 0)"

    It's because __nr_host_mem_descs__ is not cleared to 0 unlike
    __host_mem_descs__ is so.

    Signed-off-by: Minwoo Im
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Minwoo Im
     
  • [ Upstream commit 4af7f7ff92a42b6c713293c99e7982bcfcf51a70 ]

    In order to guarantee that the HCA will never get an access violation
    (either from invalidated rkey or from iommu) when retrying a send
    operation we must complete a request only when both send completion and
    the nvme cqe has arrived. We need to set the send/recv completions flags
    atomically because we might have more than a single context accessing the
    request concurrently (one is cq irq-poll context and the other is
    user-polling used in IOCB_HIPRI).

    Only then we are safe to invalidate the rkey (if needed), unmap the host
    buffers, and complete the IO.

    Signed-off-by: Sagi Grimberg
    Reviewed-by: Max Gurtovoy
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     
  • [ Upstream commit 619c62dcc62b957d17cccde2081cad527b020883 ]

    Whenever a cmd is received a reference is taken while looking up the
    queue. The reference is removed after the cmd is done as the iod is
    returned for reuse. The fod may be reused for a deferred (recevied but
    no job context) cmd. Existing code removes the reference only if the
    fod is not reused for another command. Given the fod may be used for
    one or more ios, although a reference was taken per io, it won't be
    matched on the frees.

    Remove the reference on every fod free. This pairs the references to
    each io.

    Signed-off-by: James Smart
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    James Smart
     
  • [ Upstream commit 244a8fe40a09c218622eb9927b9090b0a9b73a1a ]

    hmb descriptor idx out-of-bound occurs in case of below conditions.
    preferred = 128MiB
    chunk_size = 4MiB
    hmmaxd = 1

    Current code will not allow rmmod which will free hmb descriptors
    to be done successfully in above case.

    "descs[i]" will be set in for-loop without seeing any conditions
    related to "max_entries" after a single "descs" was allocated by
    (max_entries = 1) in this case.

    Added a condition into for-loop to check index of descriptors.

    Fixes: 044a9df1("nvme-pci: implement the HMB entry number and size limitations")
    Signed-off-by: Minwoo Im
    Reviewed-by: Keith Busch
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Minwoo Im
     
  • [ Upstream commit 8427bbc224863e14d905c87920d4005cb3e88ac3 ]

    The NVMe device in question drops off the PCIe bus after system suspend.
    I've tried several approaches to workaround this issue, but none of them
    works:
    - NVME_QUIRK_DELAY_BEFORE_CHK_RDY
    - NVME_QUIRK_NO_DEEPEST_PS
    - Disable APST before controller shutdown
    - Delay between controller shutdown and system suspend
    - Explicitly set power state to 0 before controller shutdown

    Fortunately it's a desktop, so disable APST won't hurt the battery.

    Also, change the quirk function name to reflect it's for vendor
    combination quirks.

    BugLink: https://bugs.launchpad.net/bugs/1705748
    Signed-off-by: Kai-Heng Feng
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Kai-Heng Feng
     
  • [ Upstream commit 9d7fab04b95e8c26014a9bfc1c943b8360b44c17 ]

    In case the queue is not LIVE (fully functional and connected at the nvmf
    level), we cannot allow any commands other than connect to pass through.

    Add a new queue state flag NVME_LOOP_Q_LIVE which is set after nvmf connect
    and cleared in queue teardown.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     
  • [ Upstream commit 9e0ed16ab9a9aaf670b81c9cd05b5e50defed654 ]

    In case the queue is not LIVE (fully functional and connected at the nvmf
    level), we cannot allow any commands other than connect to pass through.

    Add a new queue state flag NVME_FC_Q_LIVE which is set after nvmf connect
    and cleared in queue teardown.

    Signed-off-by: Sagi Grimberg
    Reviewed-by: James Smart
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg
     
  • [ Upstream commit 48832f8d58cfedb2f9bee11bbfbb657efb42e7e7 ]

    When the fabrics queue is not alive and fully functional, no commands
    should be allowed to pass but connect (which moves the queue to a fully
    functional state). Any other command should be failed, with either
    temporary status BLK_STS_RESOUCE or permanent status BLK_STS_IOERR.

    This is shared across all fabrics, hence move the check to fabrics
    library.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Sagi Grimberg