Eric Lee / smarc-fsl-linux-kernel

29 Oct, 2018

1 commit

45cde2f81 block: add a poll_fn callback to struct request_queue ... Browse Code »

That we we can also poll non blk-mq queues. Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit ea435e1b9392a33deceaea2a16ebaa3397bead93)

Christoph Hellwig
2018-10-29 11:10:38 +0800

13 Oct, 2018

1 commit

0f6e2f4e0 nvme_fc: fix ctrl create failures racing with workq items ... Browse Code »

commit cf25809bec2c7df4b45df5b2196845d9a4a3c89b upstream.

If there are errors during initial controller create, the transport
will teardown the partially initialized controller struct and free
the ctlr memory. Trouble is - most of those errors can occur due
to asynchronous events happening such io timeouts and subsystem
connectivity failures. Those failures invoke async workq items to
reset the controller and attempt reconnect. Those may be in progress
as the main thread frees the ctrl memory, resulting in NULL ptr oops.

Prevent this from happening by having the main ctrl failure thread
changing state to DELETING followed by synchronously cancelling any
pending queued work item. The change of state will prevent the
scheduling of resets or reconnect events.

Signed-off-by: James Smart
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe
Signed-off-by: Amit Pundir
Signed-off-by: Greg Kroah-Hartman

James Smart
2018-10-13 15:27:28 +0800

10 Oct, 2018

1 commit

f36f3ebdf nvmet-rdma: fix possible bogus dereference under heavy load ... Browse Code »

[ Upstream commit 8407879c4e0d7731f6e7e905893cecf61a7762c7 ]

Currently we always repost the recv buffer before we send a response
capsule back to the host. Since ordering is not guaranteed for send
and recv completions, it is posible that we will receive a new request
from the host before we got a send completion for the response capsule.

Today, we pre-allocate 2x rsps the length of the queue, but in reality,
under heavy load there is nothing that is really preventing the gap to
expand until we exhaust all our rsps.

To fix this, if we don't have any pre-allocated rsps left, we dynamically
allocate a rsp and make sure to free it when we are done. If under memory
pressure we fail to allocate a rsp, we silently drop the command and
wait for the host to retry.

Reported-by: Steve Wise
Tested-by: Steve Wise
Signed-off-by: Sagi Grimberg
[hch: dropped a superflous assignment]
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Sagi Grimberg
2018-10-10 14:54:24 +0800

04 Oct, 2018

1 commit

d11237bdc nvme-fcloop: Fix dropped LS's to removed target port ... Browse Code »

[ Upstream commit afd299ca996929f4f98ac20da0044c0cdc124879 ]

When a targetport is removed from the config, fcloop will avoid calling
the LS done() routine thinking the targetport is gone. This leaves the
initiator reset/reconnect hanging as it waits for a status on the
Create_Association LS for the reconnect.

Change the filter in the LS callback path. If tport null (set when
failed validation before "sending to remote port"), be sure to call
done. This was the main bug. But, continue the logic that only calls
done if tport was set but there is no remoteport (e.g. case where
remoteport has been removed, thus host doesn't expect a completion).

Signed-off-by: James Smart
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

James Smart
2018-10-04 08:00:59 +0800

26 Sep, 2018

1 commit

3cb3868f9 nvme-rdma: unquiesce queues when deleting the controller ... Browse Code »

[ Upstream commit 90140624e8face94207003ac9a9d2a329b309d68 ]

If the controller is going away, we need to unquiesce the IO queues so
that all pending request can fail gracefully before moving forward with
controller deletion. Do that before we destroy the IO queues so
blk_cleanup_queue won't block in freeze.

Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Sagi Grimberg
2018-09-26 14:38:02 +0800

05 Sep, 2018

1 commit

0c9bed369 nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event ... Browse Code »

commit f1ed3df20d2d223e0852cc4ac1f19bba869a7e3c upstream.

In many architectures loads may be reordered with older stores to
different locations. In the nvme driver the following two operations
could be reordered:

- Write shadow doorbell (dbbuf_db) into memory.
- Read EventIdx (dbbuf_ei) from memory.

This can result in a potential race condition between driver and VM host
processing requests (if given virtual NVMe controller has a support for
shadow doorbell). If that occurs, then the NVMe controller may decide to
wait for MMIO doorbell from guest operating system, and guest driver may
decide not to issue MMIO doorbell on any of subsequent commands.

This issue is purely timing-dependent one, so there is no easy way to
reproduce it. Currently the easiest known approach is to run "Oracle IO
Numbers" (orion) that is shipped with Oracle DB:

orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
concat -write 40 -duration 120 -matrix row -testname nvme_test

Where nvme_test is a .lun file that contains a list of NVMe block
devices to run test against. Limiting number of vCPUs assigned to given
VM instance seems to increase chances for this bug to occur. On test
environment with VM that got 4 NVMe drives and 1 vCPU assigned the
virtual NVMe controller hang could be observed within 10-20 minutes.
That correspond to about 400-500k IO operations processed (or about
100GB of IO read/writes).

Orion tool was used as a validation and set to run in a loop for 36
hours (equivalent of pushing 550M IO operations). No issues were
observed. That suggest that the patch fixes the issue.

Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices")
Signed-off-by: Michal Wnukowski
Reviewed-by: Keith Busch
Reviewed-by: Sagi Grimberg
[hch: updated changelog and comment a bit]
Signed-off-by: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Michal Wnukowski
2018-09-05 15:26:36 +0800

24 Aug, 2018

2 commits

7a12f4ed0 nvme: fix handling of metadata_len for NVME_IOCTL_IO_CMD ... Browse Code »

[ Upstream commit 9b382768135ee3ff282f828c906574a8478e036b ]

The old code in nvme_user_cmd() passed the userspace virtual address
from nvme_passthru_cmd.metadata as the length of the metadata buffer
as well as the address to nvme_submit_user_cmd().

Fixes: 63263d60 ("nvme: Use metadata for passthrough commands")
Signed-off-by: Roland Dreier
Reviewed-by: Keith Busch
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Roland Dreier
2018-08-24 19:09:21 +0800
955887c1f nvmet: reset keep alive timer in controller enable ... Browse Code »

[ Upstream commit d68a90e148f5a82aa67654c5012071e31c0e4baa ]

Controllers that are not yet enabled should not really enforce keep alive
timeouts, but we still want to track a timeout and cleanup in case a host
died before it enabled the controller. Hence, simply reset the keep
alive timer when the controller is enabled.

Suggested-by: Max Gurtovoy
Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Max Gurtuvoy
2018-08-24 19:09:02 +0800

09 Aug, 2018

3 commits

d626ac966 nvmet-fc: fix target sgl list on large transfers ... Browse Code »

commit d082dc1562a2ff0947b214796f12faaa87e816a9 upstream.

The existing code to carve up the sg list expected an sg element-per-page
which can be very incorrect with iommu's remapping multiple memory pages
to fewer bus addresses. To hit this error required a large io payload
(greater than 256k) and a system that maps on a per-page basis. It's
possible that large ios could get by fine if the system condensed the
sgl list into the first 64 elements.

This patch corrects the sg list handling by specifically walking the
sg list element by element and attempting to divide the transfer up
on a per-sg element boundary. While doing so, it still tries to keep
sequences under 256k, but will exceed that rule if a single sg element
is larger than 256k.

Fixes: 48fa362b6c3f ("nvmet-fc: simplify sg list handling")
Cc: # 4.14
Signed-off-by: James Smart
Signed-off-by: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

James Smart
2018-08-09 18:16:39 +0800
4af9c61ad nvme-pci: Fix queue double allocations ... Browse Code »

commit 62314e405fa101dbb82563394f9dfc225e3f1167 upstream.

The queue count says the highest queue that's been allocated, so don't
reallocate a queue lower than that.

Fixes: 147b27e4bd0 ("nvme-pci: allocate device queues storage space at probe")
Signed-off-by: Keith Busch
Signed-off-by: Christoph Hellwig
Signed-off-by: Jon Derrick
Signed-off-by: Greg Kroah-Hartman

Keith Busch
2018-08-09 18:16:39 +0800
12c058df8 nvme-pci: allocate device queues storage space at probe ... Browse Code »

commit 147b27e4bd08406a6abebedbb478b431ec197be1 upstream.

It may cause race by setting 'nvmeq' in nvme_init_request()
because .init_request is called inside switching io scheduler, which
may happen when the NVMe device is being resetted and its nvme queues
are being freed and created. We don't have any sync between the two
pathes.

This patch changes the nvmeq allocation to occur at probe time so
there is no way we can dereference it at init_request.

[ 93.268391] kernel BUG at drivers/nvme/host/pci.c:408!
[ 93.274146] invalid opcode: 0000 [#1] SMP
[ 93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me
ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei
shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod
mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3
megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4
[ 93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[ 93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1
[ 93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme]
[ 93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246
[ 93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
[ 93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008
[ 93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00
[ 93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00
[ 93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071
[ 93.422969] FS: 00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000
[ 93.431996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0
[ 93.446368] Call Trace:
[ 93.449103] blk_mq_alloc_rqs+0x231/0x2a0
[ 93.453579] blk_mq_sched_alloc_tags.isra.8+0x42/0x80
[ 93.459214] blk_mq_init_sched+0x7e/0x140
[ 93.463687] elevator_switch+0x5a/0x1f0
[ 93.467966] ? elevator_get.isra.17+0x52/0xc0
[ 93.472826] elv_iosched_store+0xde/0x150
[ 93.477299] queue_attr_store+0x4e/0x90
[ 93.481580] kernfs_fop_write+0xfa/0x180
[ 93.485958] __vfs_write+0x33/0x170
[ 93.489851] ? __inode_security_revalidate+0x4c/0x60
[ 93.495390] ? selinux_file_permission+0xda/0x130
[ 93.500641] ? _cond_resched+0x15/0x30
[ 93.504815] vfs_write+0xad/0x1a0
[ 93.508512] SyS_write+0x52/0xc0
[ 93.512113] do_syscall_64+0x61/0x1a0
[ 93.516199] entry_SYSCALL64_slow_path+0x25/0x25
[ 93.521351] RIP: 0033:0x7f33ce96aab0
[ 93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0
[ 93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001
[ 93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740
[ 93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400
[ 93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000
[ 93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0
74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de
0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89
[ 93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8
[ 93.602273] ---[ end trace 810dde3993e5f14e ]---

Reported-by: Yi Zhang
Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Jon Derrick
Signed-off-by: Greg Kroah-Hartman

Sagi Grimberg
2018-08-09 18:16:39 +0800

03 Aug, 2018

3 commits

2ee4fbcd2 nvme: lightnvm: add granby support ... Browse Code »

[ Upstream commit ea48e877994f086af481427bac110aa63686c3ce ]

Add a new lightnvm quirk to identify CNEX’s Granby controller.

Signed-off-by: Wei Xu
Reviewed-by: Javier González
Reviewed-by: Matias Bjørling
Signed-off-by: Keith Busch
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Wei Xu
2018-08-03 13:50:38 +0800
1afb8720b nvme-pci: Fix AER reset handling ... Browse Code »

[ Upstream commit 72cd4cc28e234ed7189ee508ed65ab60c80a97c8 ]

The nvme timeout handling doesn't do anything if the pci channel is
offline, which is the case when recovering from PCI error event, so it
was a bad idea to sync the controller reset in this state. This patch
flushes the reset work in the error_resume callback instead when the
channel is back to online. This keeps AER handling serialized and
can recover from timeouts.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199757
Fixes: cc1d5e749a2e ("nvme/pci: Sync controller reset for AER slot_reset")
Reported-by: Alex Gagniuc
Tested-by: Alex Gagniuc
Signed-off-by: Keith Busch
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Keith Busch
2018-08-03 13:50:29 +0800
4bb1acf80 nvme-rdma: stop admin queue before freeing it ... Browse Code »

[ Upstream commit 2e050f00a0f0e07467050cb4afae0234941e5bf3 ]

For any failure after nvme_rdma_start_queue in
nvme_rdma_configure_admin_queue, the admin queue will be freed with the
NVME_RDMA_Q_LIVE flag still set. Once nvme_rdma_stop_queue is invoked,
that will cause a use-after-free.
BUG: KASAN: use-after-free in rdma_disconnect+0x1f/0xe0 [rdma_cm]

To fix it, call nvme_rdma_stop_queue for all the failed cases after
nvme_rdma_start_queue.

Signed-off-by: Jianchao Wang
Suggested-by: Sagi Grimberg
Reviewed-by: Max Gurtovoy
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Jianchao Wang
2018-08-03 13:50:29 +0800

17 Jul, 2018

1 commit

2a017ea2e nvme-pci: Remap CMB SQ entries on every controller reset ... Browse Code »

commit 815c6704bf9f1c59f3a6be380a4032b9c57b12f1 upstream.

The controller memory buffer is remapped into a kernel address on each
reset, but the driver was setting the submission queue base address
only on the very first queue creation. The remapped address is likely to
change after a reset, so accessing the old address will hit a kernel bug.

This patch fixes that by setting the queue's CMB base address each time
the queue is created.

Fixes: f63572dff1421 ("nvme: unmap CMB and remove sysfs file in reset path")
Reported-by: Christian Black
Cc: Jon Derrick
Cc: # 4.9+
Signed-off-by: Keith Busch
Reviewed-by: Christoph Hellwig
Signed-off-by: Scott Bauer
Reviewed-by: Jon Derrick
Signed-off-by: Greg Kroah-Hartman

Keith Busch
2018-07-17 17:39:31 +0800

21 Jun, 2018

4 commits

ac5e86401 nvme: Set integrity flag for user passthrough commands ... Browse Code »

[ Upstream commit f31a21103c03bb62846409fdc60cc9faf2398cfb ]

If the command a separate metadata buffer attached, the request needs
to have the integrity flag set so the driver knows to map it.

Signed-off-by: Keith Busch
Reviewed-by: Martin K. Petersen
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Keith Busch
2018-06-21 03:02:53 +0800
b19122a48 nvme: fix potential memory leak in option parsing ... Browse Code »

[ Upstream commit 59a2f3f00fd744dbad22593f47552037d3154ca6 ]

When specifying same string type option several times,
current option parsing may cause memory leak. Hence,
call kfree for previous one in this case.

Signed-off-by: Chengguang Xu
Reviewed-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Chengguang Xu
2018-06-21 03:02:53 +0800
4e2b7d168 nvmet-rdma: depend on INFINIBAND_ADDR_TRANS ... Browse Code »

[ Upstream commit d6fc6a22fc7d3df987666725496ed5dd2dd30f23 ]

NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency. This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen
Cc: Tarick Bedeir
Signed-off-by: Doug Ledford
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Greg Thelen
2018-06-21 03:02:50 +0800
8e6dba916 nvme: depend on INFINIBAND_ADDR_TRANS ... Browse Code »

[ Upstream commit 3af7a156bdc356946098e13180be66b6420619bf ]

NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols. So
declare the kconfig dependency. This is necessary to allow for enabling
INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen
Cc: Tarick Bedeir
Signed-off-by: Doug Ledford
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Greg Thelen
2018-06-21 03:02:50 +0800

30 May, 2018

6 commits

085ec7d55 nvme-pci: disable APST for Samsung NVMe SSD 960 EVO + ASUS PRIME Z370-A ... Browse Code »

[ Upstream commit 467c77d4cbefaaf65e2f44fe102d543a52fcae5b ]

Yet another "incompatible" Samsung NVMe SSD 960 EVO and Asus motherboard
combination. 960 EVO device disappears from PCIe bus within few minutes
after boot-up when APST is in use and never gets back. Forcing
NVME_QUIRK_NO_APST is the only way to make this drive work with this
particular motherboard. NVME_QUIRK_NO_DEEPEST_PS doesn't work, upgrading
motherboard's BIOS didn't help either.
Since this is a desktop motherboard, the only drawback of not using APST
is increased device temperature.

Signed-off-by: Jarosław Janik
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Jarosław Janik
2018-05-30 13:52:29 +0800
1908ca222 nvme: don't send keep-alives to the discovery controller ... Browse Code »

[ Upstream commit 74c6c71530847808d4e3be7b205719270efee80c ]

NVMe over Fabrics 1.0 Section 5.2 "Discovery Controller Properties and
Command Support" Figure 31 "Discovery Controller – Admin Commands"
explicitly listst all commands but "Get Log Page" and "Identify" as
reserved, but NetApp report the Linux host is sending Keep Alive
commands to the discovery controller, which is a violation of the
Spec.

We're already checking for discovery controllers when configuring the
keep alive timeout but when creating a discovery controller we're not
hard wiring the keep alive timeout to 0 and thus remain on
NVME_DEFAULT_KATO for the discovery controller.

This can be easily remproduced when issuing a direct connect to the
discovery susbsystem using:
'nvme connect [...] --nqn=nqn.2014-08.org.nvmexpress.discovery'

Signed-off-by: Johannes Thumshirn
Fixes: 07bfcd09a288 ("nvme-fabrics: add a generic NVMe over Fabrics library")
Reported-by: Martin George
Reviewed-by: Christoph Hellwig
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Johannes Thumshirn
2018-05-30 13:52:22 +0800
2b103dee2 nvme: pci: pass max vectors as num_possible_cpus() to pci_alloc_irq_vectors ... Browse Code »

[ Upstream commit 16ccfff2897613007b5eda9e29d65303c6280026 ]

84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
has switched to do irq vectors spread among all possible CPUs, so
pass num_possible_cpus() as max vecotrs to be assigned.

For example, in a 8 cores system, 0~3 online, 4~8 offline/not present,
see 'lscpu':

[ming@box]$lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 2
NUMA node(s): 2
...
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s):
...

1) before this patch, follows the allocated vectors and their affinity:
irq 47, cpu list 0,4
irq 48, cpu list 1,6
irq 49, cpu list 2,5
irq 50, cpu list 3,7

2) after this patch, follows the allocated vectors and their affinity:
irq 43, cpu list 0
irq 44, cpu list 1
irq 45, cpu list 2
irq 46, cpu list 3
irq 47, cpu list 4
irq 48, cpu list 6
irq 49, cpu list 5
irq 50, cpu list 7

Cc: Keith Busch
Cc: Sagi Grimberg
Cc: Thomas Gleixner
Cc: Christoph Hellwig
Signed-off-by: Ming Lei
Reviewed-by: Christoph Hellwig
Signed-off-by: Keith Busch
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-05-30 13:52:09 +0800
d68e66060 nvme-pci: Fix EEH failure on ppc ... Browse Code »

[ Upstream commit 651438bb0af5213f1f70d66e75bf11d08cb5537a ]

Triggering PPC EEH detection and handling requires a memory mapped read
failure. The NVMe driver removed the periodic health check MMIO, so
there's no early detection mechanism to trigger the recovery. Instead,
the detection now happens when the nvme driver handles an IO timeout
event. This takes the pci channel offline, so we do not want the driver
to proceed with escalating its own recovery efforts that may conflict
with the EEH handler.

This patch ensures the driver will observe the channel was set to offline
after a failed MMIO read and resets the IO timer so the EEH handler has
a chance to recover the device.

Signed-off-by: Wen Xiong
[updated change log]
Signed-off-by: Keith Busch
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Wen Xiong
2018-05-30 13:52:09 +0800
e0a5a0f47 nvmet: fix PSDT field check in command format ... Browse Code »

[ Upstream commit bffd2b61670feef18d2535e9b53364d270a1c991 ]

PSDT field section according to NVM_Express-1.3:
"This field specifies whether PRPs or SGLs are used for any data
transfer associated with the command. PRPs shall be used for all
Admin commands for NVMe over PCIe. SGLs shall be used for all Admin
and I/O commands for NVMe over Fabrics. This field shall be set to
01b for NVMe over Fabrics 1.0 implementations.

Suggested-by: Idan Burstein
Signed-off-by: Max Gurtovoy
Reviewed-by: Christoph Hellwig
Signed-off-by: Keith Busch
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Max Gurtovoy
2018-05-30 13:52:06 +0800
44cb7ed6e nvme-pci: Fix nvme queue cleanup if IRQ setup fails ... Browse Code »

[ Upstream commit f25a2dfc20e3a3ed8fe6618c331799dd7bd01190 ]

This patch fixes nvme queue cleanup if requesting an IRQ handler for
the queue's vector fails. It does this by resetting the cq_vector to
the uninitialized value of -1 so it is ignored for a controller reset.

Signed-off-by: Jianchao Wang
[changelog updates, removed misc whitespace changes]
Signed-off-by: Keith Busch
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Jianchao Wang
2018-05-30 13:52:04 +0800

16 May, 2018

1 commit

57e2ce8bb nvme: add quirk to force medium priority for SQ creation ... Browse Code »

commit 9abd68ef454c824bfd18629033367b4382b5f390 upstream.

Some P3100 drives have a bug where they think WRRU (weighted round robin)
is always enabled, even though the host doesn't set it. Since they think
it's enabled, they also look at the submission queue creation priority. We
used to set that to MEDIUM by default, but that was removed in commit
81c1cd98351b. This causes various issues on that drive. Add a quirk to
still set MEDIUM priority for that controller.

Fixes: 81c1cd98351b ("nvme/pci: Don't set reserved SQ create flags")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe
Signed-off-by: Keith Busch
Signed-off-by: Greg Kroah-Hartman

Jens Axboe
2018-05-16 16:10:30 +0800

12 Apr, 2018

2 commits

f0504bf54 nvme_fcloop: fix abort race condition ... Browse Code »

[ Upstream commit 278e096063f1914fccfc77a617be9fc8dbb31b0e ]

A test case revealed a race condition of an i/o completing on a thread
parallel to the delete_association generating the aborts for the
outstanding ios on the controller. The i/o completion was freeing the
target fcloop context, thus the abort task referenced the just-freed
memory.

Correct by clearing the target/initiator cross pointers in the io
completion and abort tasks before calling the callbacks. On aborts
that detect already finished io's, ensure the complete context is
called.

Signed-off-by: James Smart
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

James Smart
2018-04-12 18:32:16 +0800
39ede1fd2 nvme_fcloop: disassocate local port structs ... Browse Code »

[ Upstream commit 6fda20283e55b9d288cd56822ce39fc8e64f2208 ]

The current fcloop driver gets its lport structure from the private
area co-allocated with the fc_localport. All is fine except the
teardown path, which wants to wait on the completion, which is marked
complete by the delete_localport callback performed after
unregister_localport. The issue is, the nvme_fc transport frees the
localport structure immediately after delete_localport is called,
meaning the original routine is trying to wait on a complete that
was just freed.

Change such that a lport struct is allocated coincident with the
addition and registration of a localport. The private area of the
localport now contains just a backpointer to the real lport struct.
Now, the completion can be waited for, and after completing, the
new structure can be kfree'd.

Signed-off-by: James Smart
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

James Smart
2018-04-12 18:32:16 +0800

09 Mar, 2018

1 commit

df11c2268 nvme-rdma: don't suppress send completions ... Browse Code »

commit b4b591c87f2b0f4ebaf3a68d4f13873b241aa584 upstream.

The entire completions suppress mechanism is currently broken because the
HCA might retry a send operation (due to dropped ack) after the nvme
transaction has completed.

In order to handle this, we signal all send completions and introduce a
separate done handler for async events as they will be handled differently
(as they don't include in-capsule data by definition).

Signed-off-by: Sagi Grimberg
Reviewed-by: Max Gurtovoy
Signed-off-by: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Sagi Grimberg
2018-03-09 14:41:26 +0800

03 Mar, 2018

3 commits

fa10314f2 nvme-fabrics: initialize default host->id in nvmf_host_default() ... Browse Code »

[ Upstream commit 6b018235b4daabae96d855219fae59c3fb8be417 ]

The field was uninitialized before use.

Signed-off-by: Ewan D. Milne
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ewan D. Milne
2018-03-03 17:24:36 +0800
fbd047ffc nvme: check hw sectors before setting chunk sectors ... Browse Code »

[ Upstream commit 249159c5f15812140fa216f9997d799ac0023a1f ]

Some devices with IDs matching the "stripe" quirk don't actually have
this quirk, and don't have an MDTS value. When MDTS is not set, the
driver sets the max sectors to UINT_MAX, which is not a power of 2,
hitting a BUG_ON from blk_queue_chunk_sectors. This patch skips setting
chunk sectors for such devices.

Signed-off-by: Keith Busch
Reviewed-by: Martin K. Petersen
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Keith Busch
2018-03-03 17:24:23 +0800
a1aef5ce2 nvme-fc: remove double put reference if admin connect fails ... Browse Code »

[ Upstream commit 4596e752db02d47038cd7c965419789ab15d1985 ]

There are two put references in the failure case of initial
create_association. The first put actually frees the controller, thus the
second put references freed memory.

Remove the unnecessary 2nd put.

Signed-off-by: James Smart
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

James Smart
2018-03-03 17:24:22 +0800

04 Feb, 2018

8 commits

95a7d2341 nvme-pci: fix NULL pointer dereference in nvme_free_host_mem() ... Browse Code »

[ Upstream commit 7e5dd57ef3081ff6c03908d786ed5087f6fbb7ae ]

Following condition which will cause NULL pointer dereference will
occur in nvme_free_host_mem() when it tries to remove pci device via
nvme_remove() especially after a failure of host memory allocation for HMB.

"(host_mem_descs == NULL) && (nr_host_mem_descs != 0)"

It's because __nr_host_mem_descs__ is not cleared to 0 unlike
__host_mem_descs__ is so.

Signed-off-by: Minwoo Im
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Minwoo Im
2018-02-04 00:39:10 +0800
dd45c5e5b nvme-rdma: don't complete requests before a send work request has completed ... Browse Code »

[ Upstream commit 4af7f7ff92a42b6c713293c99e7982bcfcf51a70 ]

In order to guarantee that the HCA will never get an access violation
(either from invalidated rkey or from iommu) when retrying a send
operation we must complete a request only when both send completion and
the nvme cqe has arrived. We need to set the send/recv completions flags
atomically because we might have more than a single context accessing the
request concurrently (one is cq irq-poll context and the other is
user-polling used in IOCB_HIPRI).

Only then we are safe to invalidate the rkey (if needed), unmap the host
buffers, and complete the IO.

Signed-off-by: Sagi Grimberg
Reviewed-by: Max Gurtovoy
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Sagi Grimberg
2018-02-04 00:39:02 +0800
71686d2a1 nvmet-fc: correct ref counting error when deferred rcv used ... Browse Code »

[ Upstream commit 619c62dcc62b957d17cccde2081cad527b020883 ]

Whenever a cmd is received a reference is taken while looking up the
queue. The reference is removed after the cmd is done as the iod is
returned for reuse. The fod may be reused for a deferred (recevied but
no job context) cmd. Existing code removes the reference only if the
fod is not reused for another command. Given the fod may be used for
one or more ios, although a reference was taken per io, it won't be
matched on the frees.

Remove the reference on every fod free. This pairs the references to
each io.

Signed-off-by: James Smart
Reviewed-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

James Smart
2018-02-04 00:38:59 +0800
93a4bcf2c nvme-pci: avoid hmb desc array idx out-of-bound when hmmaxd set. ... Browse Code »

[ Upstream commit 244a8fe40a09c218622eb9927b9090b0a9b73a1a ]

hmb descriptor idx out-of-bound occurs in case of below conditions.
preferred = 128MiB
chunk_size = 4MiB
hmmaxd = 1

Current code will not allow rmmod which will free hmb descriptors
to be done successfully in above case.

"descs[i]" will be set in for-loop without seeing any conditions
related to "max_entries" after a single "descs" was allocated by
(max_entries = 1) in this case.

Added a condition into for-loop to check index of descriptors.

Fixes: 044a9df1("nvme-pci: implement the HMB entry number and size limitations")
Signed-off-by: Minwoo Im
Reviewed-by: Keith Busch
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Minwoo Im
2018-02-04 00:38:58 +0800
128dc55f8 nvme-pci: disable APST on Samsung SSD 960 EVO + ASUS PRIME B350M-A ... Browse Code »

[ Upstream commit 8427bbc224863e14d905c87920d4005cb3e88ac3 ]

The NVMe device in question drops off the PCIe bus after system suspend.
I've tried several approaches to workaround this issue, but none of them
works:
- NVME_QUIRK_DELAY_BEFORE_CHK_RDY
- NVME_QUIRK_NO_DEEPEST_PS
- Disable APST before controller shutdown
- Delay between controller shutdown and system suspend
- Explicitly set power state to 0 before controller shutdown

Fortunately it's a desktop, so disable APST won't hurt the battery.

Also, change the quirk function name to reflect it's for vendor
combination quirks.

BugLink: https://bugs.launchpad.net/bugs/1705748
Signed-off-by: Kai-Heng Feng
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Kai-Heng Feng
2018-02-04 00:38:58 +0800
7af5f9137 nvme-loop: check if queue is ready in queue_rq ... Browse Code »

[ Upstream commit 9d7fab04b95e8c26014a9bfc1c943b8360b44c17 ]

In case the queue is not LIVE (fully functional and connected at the nvmf
level), we cannot allow any commands other than connect to pass through.

Add a new queue state flag NVME_LOOP_Q_LIVE which is set after nvmf connect
and cleared in queue teardown.

Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Sagi Grimberg
2018-02-04 00:38:58 +0800
db2044fc4 nvme-fc: check if queue is ready in queue_rq ... Browse Code »

[ Upstream commit 9e0ed16ab9a9aaf670b81c9cd05b5e50defed654 ]

In case the queue is not LIVE (fully functional and connected at the nvmf
level), we cannot allow any commands other than connect to pass through.

Add a new queue state flag NVME_FC_Q_LIVE which is set after nvmf connect
and cleared in queue teardown.

Signed-off-by: Sagi Grimberg
Reviewed-by: James Smart
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Sagi Grimberg
2018-02-04 00:38:58 +0800
26bd01c1a nvme-fabrics: introduce init command check for a queue that is not alive ... Browse Code »

[ Upstream commit 48832f8d58cfedb2f9bee11bbfbb657efb42e7e7 ]

When the fabrics queue is not alive and fully functional, no commands
should be allowed to pass but connect (which moves the queue to a fully
functional state). Any other command should be failed, with either
temporary status BLK_STS_RESOUCE or permanent status BLK_STS_IOERR.

This is shared across all fabrics, hence move the check to fabrics
library.

Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Sagi Grimberg
2018-02-04 00:38:57 +0800