Eric Lee / smarc-fsl-linux-kernel

10 Oct, 2015

6 commits

2812dfe37 nvme: include <linux/types.ĥ> in <linux/nvme.h> ... Browse Code »

The buildbot complains about this even if it doesn't generate
a a build warning. But it's an easy fix, so here we go:

Reported-by: kbuild test robot
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-10-10 00:40:37 +0800
08c69640c nvme.h: add missing nvme_id_ctrl endianess annotations ... Browse Code »

Signed-off-by: Christoph Hellwig
Acked-by: Keith Busch
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-10-10 00:40:37 +0800
9d99a8dda nvme: move hardware structures out of the uapi version of nvme.h ... Browse Code »

Currently all NVMe command and completion structures are exposed to userspace
through the uapi version of nvme.h. They are not an ABI between the kernel
and userspace, and will change in C-incompatible way for future versions of
the spec. Move them to the kernel version of the file and rename the uapi
header to nvme_ioctl.h so that userspace can easily detect the presence of
the new clean header. Nvme-cli already carries a local copy of the header,
so it won't be affected by this move.

Signed-off-by: Christoph Hellwig
Acked-by: Keith Busch
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-10-10 00:40:37 +0800
f11bb3e24 nvme: add a local nvme.h header ... Browse Code »

Add a new drivers/block/nvme.h which contains all the driver internal
interface.

Signed-off-by: Christoph Hellwig
Acked-by: Keith Busch
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-10-10 00:40:37 +0800
0a7385ad6 NVMe: Simplify device resume on io queue failure ... Browse Code »

Releasing IO queues and disks was done in a work queue outside the
controller resume context to delete namespaces if the controller failed
after a resume from suspend. This is unnecessary since we can resume
a device asynchronously.

This patch makes resume use probe_work so it can directly remove
namespaces if the device is manageable but not IO capable. Since the
deleting disks was the only reason we had the convoluted "reset_workfn",
this patch removes that unnecessary indirection.

Signed-off-by: Keith Busch
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Keith Busch
2015-10-10 00:40:36 +0800
188c3568f NVMe: Reference count open namespaces ... Browse Code »

Dynamic namespace attachment means the namespace may be removed at any
time, so the namespace reference count can not be tied to the device
reference count. This fixes a NULL dereference if an opened namespace
is detached from a controller.

Signed-off-by: Keith Busch
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Keith Busch
2015-10-10 00:40:36 +0800

19 Aug, 2015

2 commits

81f03fedc NVMe: Add nvme subsystem reset IOCTL ... Browse Code »

Controllers can perform optional subsystem resets as introduced in NVMe
1.1. This patch adds an IOCTL to trigger the subsystem reset by writing
"NVMe" to the NSSR register.

Signed-off-by: Jon Derrick
Acked-by: Keith Busch
Signed-off-by: Jens Axboe

Jon Derrick
2015-08-19 01:56:13 +0800
dfbac8c7a NVMe: Add nvme subsystem reset support ... Browse Code »

Controllers part of an NVMe subsystem may be reset by any other controller
in the subsystem. If the device is capable of subsystem resets, this
patch adds detection for such events and performs appropriate controller
initialization upon subsystem reset detection.

The register bit is a RW1C type, so the driver needs to write a 1 to the
status bit to clear the subsystem reset occured bit during initialization.

Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Keith Busch
2015-08-19 01:56:11 +0800

21 Jul, 2015

1 commit

8ffaadf74 NVMe: Use CMB for the IO SQes if available ... Browse Code »

Some controllers have a controller-side memory buffer available for use
for submissions, completions, lists, or data.

If a CMB is available, the entire CMB will be ioremapped and it will
attempt to map the IO SQes onto the CMB. The queues will be shrunk as
needed. The CMB will not be used if the queue depth is shrunk below some
threshold where it may have reduced performance over a larger queue
in system memory.

Signed-off-by: Jon Derrick
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jon Derrick
2015-07-21 23:40:11 +0800

06 Jun, 2015

1 commit

a5768aa88 NVMe: Automatic namespace rescan ... Browse Code »

Namespaces may be dynamically allocated and deleted or attached and
detached. This has the driver rescan the device for namespace changes
after each device reset or namespace change asynchronous event.

There could potentially be many detached namespaces that we don't want
polluting /dev/ with unusable block handles, so this will delete disks
if the namespace is not active as indicated by the response from identify
namespace. This also skips adding the disk if no capacity is provisioned
to the namespace in the first place.

Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Keith Busch
2015-06-06 00:58:34 +0800

22 May, 2015

3 commits

d29ec8241 nvme: submit internal commands through the block layer ... Browse Code »

Use block layer queues with an internal cmd_type to submit internally
generated NVMe commands. This both simplifies the code a lot and allow
for a better structure. For example now the LighNVM code can construct
commands without knowing the details of the underlying I/O descriptors.
Or a future NVMe over network target could inject commands, as well as
could the SCSI translation and ioctl code be reused for such a beast.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-22 22:37:20 +0800
e75ec752d nvme: store a struct device pointer in struct nvme_dev ... Browse Code »

Most users want the generic device, so store that in struct nvme_dev
instead of the pci_dev. This also happens to be a nice step towards
making some code reusable for non-PCI transports.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-22 22:36:33 +0800
f705f837c nvme: consolidate synchronous command submission helpers ... Browse Code »

Note that we keep the unused timeout argument, but allow callers to
pass 0 instead of a timeout if they want the default. This will allow
adding a timeout to the pass through path later on.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-22 22:36:31 +0800

08 Apr, 2015

1 commit

a67a95134 NVMe: Meta data handling through submit io ioctl ... Browse Code »

This adds support for the extended metadata formats through the submit
IO ioctl, and simplifies the rest when using a separate metadata format.

Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Keith Busch
2015-04-08 09:11:06 +0800

20 Feb, 2015

5 commits

07836e659 NVMe: Fix potential corruption during shutdown ... Browse Code »

The driver has to end unreturned commands at some point even if the
controller has not provided a completion. The driver tried to be safe by
deleting IO queues prior to ending all unreturned commands. That should
cause the controller to internally abort inflight commands, but IO queue
deletion request does not have to be successful, so all bets are off. We
still have to make progress, so to be extra safe, this patch doesn't
clear a queue to release the dma mapping for a command until after the
pci device has been disabled.

This patch removes the special handling during device initialization
so controller recovery can be done all the time. This is possible since
initialization is not inlined with pci probe anymore.

Reported-by: Nilish Choudhury
Signed-off-by: Keith Busch

Keith Busch
2015-02-20 07:15:37 +0800
2e1d84481 NVMe: Asynchronous controller probe ... Browse Code »

This performs the longest parts of nvme device probe in scheduled work.
This speeds up probe significantly when multiple devices are in use.

Signed-off-by: Keith Busch

Keith Busch
2015-02-20 07:15:36 +0800
b3fffdefa NVMe: Register management handle under nvme class ... Browse Code »

This creates a new class type for nvme devices to register their
management character devices with. This is so we do not rely on miscdev
to provide enough minors for as many nvme devices some people plan to
use. The previous limit was approximately 60 NVMe controllers, depending
on the platform and kernel. Now the limit is 1M, which ought to be enough
for anybody.

Since we have a new device class, it makes sense to attach the block
devices under this as well, so part of this patch moves the management
handle initialization prior to the namespaces discovery.

Signed-off-by: Keith Busch

Keith Busch
2015-02-20 07:15:36 +0800
4f1982b4e NVMe: Update SCSI Inquiry VPD 83h translation ... Browse Code »

The original translation created collisions on Inquiry VPD 83 for many
existing devices. Newer specifications provide other ways to translate
based on the device's version can be used to create unique identifiers.

Version 1.1 provides an EUI64 field that uniquely identifies each
namespace, and 1.2 added the longer NGUID field for the same reason.
Both follow the IEEE EUI format and readily translate to the SCSI device
identification EUI designator type 2h. For devices implementing either,
the translation will use this type, defaulting to the EUI64 8-byte type if
implemented then NGUID's 16 byte version if not. If neither are provided,
the 1.0 translation is used, and is updated to use the SCSI String format
to guarantee a unique identifier.

Knowing when to use the new fields depends on the nvme controller's
revision. The NVME_VS macro was not decoding this correctly, so that is
fixed in this patch and moved to a more appropriate place.

Since the Identify Namespace structure required an update for the NGUID
field, this patch adds the remaining new 1.2 fields to the structure.

Signed-off-by: Keith Busch

Keith Busch
2015-02-20 07:15:35 +0800
e1e5e5641 NVMe: Metadata format support ... Browse Code »

Adds support for NVMe metadata formats and exposes block devices for
all namespaces regardless of their format. Namespace formats that are
unusable will have disk capacity set to 0, but a handle to the block
device is created to simplify device management. A namespace is not
usable when the format requires host interleave block and metadata in
single buffer, has no provisioned storage, or has better data but failed
to register with blk integrity.

The namespace has to be scanned in two phases to support separate
metadata formats. The first establishes the sector size and capacity
prior to invoking add_disk. If metadata is required, the capacity will
be temporarilly set to 0 until it can be revalidated and registered with
the integrity extenstions after add_disk completes.

The driver relies on the integrity extensions to provide the metadata
buffer. NVMe requires this be a single physically contiguous region,
so only one integrity segment is allowed per command. If the metadata
is used for T10 PI, the driver provides mappings to save and restore
the reftag physical block translation. The driver provides no-op
functions for generate and verify if metadata is not used for protection
information. This way the setup is always provided by the block layer.

If a request does not supply a required metadata buffer, the command
is failed with bad address. This could only happen if a user manually
disables verify/generate on such a disk. The only exception to where
this is okay is if the controller is capable of stripping/generating
the metadata, which is possible on some types of formats.

The metadata scatter gather list now occupies the spot in the nvme_iod
that used to be used to link retryable IOD's, but we don't do that
anymore, so the field was unused.

Signed-off-by: Keith Busch

Keith Busch
2015-02-20 07:15:35 +0800

30 Jan, 2015

1 commit

ac3dd5bd1 NVMe: avoid kmalloc/kfree for smaller IO ... Browse Code »

Currently we allocate an nvme_iod for each IO, which holds the
sg list, prps, and other IO related info. Set a threshold of
2 pages and/or 8KB of data, below which we can just embed this
in the per-command pdu in blk-mq. For any IO at or below
NVME_INT_PAGES and NVME_INT_BYTES, we save a kmalloc and kfree.

For higher IOPS, this saves up to 1% of CPU time.

Signed-off-by: Jens Axboe
Reviewed-by: Keith Busch

Jens Axboe
2015-01-30 01:25:34 +0800

05 Nov, 2014

3 commits

a4aea5623 NVMe: Convert to blk-mq ... Browse Code »

This converts the NVMe driver to a blk-mq request-based driver.

The NVMe driver is currently bio-based and implements queue logic within
itself. By using blk-mq, a lot of these responsibilities can be moved
and simplified.

The patch is divided into the following blocks:

* Per-command data and cmdid have been moved into the struct request
field. The cmdid_data can be retrieved using blk_mq_rq_to_pdu() and id
maintenance are now handled by blk-mq through the rq->tag field.

* The logic for splitting bio's has been moved into the blk-mq layer.
The driver instead notifies the block layer about limited gap support in
SG lists.

* blk-mq handles timeouts and is reimplemented within nvme_timeout().
This both includes abort handling and command cancelation.

* Assignment of nvme queues to CPUs are replaced with the blk-mq
version. The current blk-mq strategy is to assign the number of
mapped queues and CPUs to provide synergy, while the nvme driver
assign as many nvme hw queues as possible. This can be implemented in
blk-mq if needed.

* NVMe queues are merged with the tags structure of blk-mq.

* blk-mq takes care of setup/teardown of nvme queues and guards invalid
accesses. Therefore, RCU-usage for nvme queues can be removed.

* IO tracing and accounting are handled by blk-mq and therefore removed.

* Queue suspension logic is replaced with the logic from the block
layer.

Contributions in this patch from:

Sam Bradshaw
Jens Axboe
Keith Busch
Robert Nelson

Acked-by: Keith Busch
Acked-by: Jens Axboe

Updated for new ->queue_rq() prototype.

Signed-off-by: Jens Axboe

Matias Bjørling
2014-11-05 04:18:52 +0800
1d0906246 NVMe: Mismatched host/device page size support ... Browse Code »

Adds support for devices with max page size smaller than the host's.
In the case we encounter such a host/device combination, the driver will
split a page into as many PRP entries as necessary for the device's page
size capabilities. If the device's reported minimum page size is greater
than the host's, the driver will not attempt to enable the device and
return an error instead.

Signed-off-by: Keith Busch
Signed-off-by: Matthew Wilcox
Signed-off-by: Jens Axboe

Keith Busch
2014-11-05 04:17:07 +0800
6fccf9383 NVMe: Async event request ... Browse Code »

Submits NVMe asynchronous event requests, one event up to the controller
maximum or number of possible different event types (8), whichever is
smaller. Events successfully returned by the controller are logged.

Signed-off-by: Keith Busch
Signed-off-by: Matthew Wilcox
Signed-off-by: Jens Axboe

Keith Busch
2014-11-05 04:17:07 +0800

13 Jun, 2014

1 commit

f3db22feb NVMe: Fix hot cpu notification dead lock ... Browse Code »

There is a potential dead lock if a cpu event occurs during nvme probe
since it registered with hot cpu notification. This fixes the race by
having the module register with notification outside of probe rather
than have each device register.

The actual work is done in a scheduled work queue instead of in the
notifier since assigning IO queues has the potential to block if the
driver creates additional queues.

Signed-off-by: Keith Busch
Signed-off-by: Matthew Wilcox

Keith Busch
2014-06-13 22:43:34 +0800

04 Jun, 2014

1 commit

bd67608a6 NVMe: Rename io_timeout to nvme_io_timeout ... Browse Code »

It's positively immoral to have a global variable called 'io_timeout'.
Keep the module parameter called io_timeout, though.

Signed-off-by: Matthew Wilcox

Matthew Wilcox
2014-06-04 11:04:30 +0800

05 May, 2014

3 commits

53562be74 NVMe: Flush with data support ... Browse Code »

It is possible a filesystem may send a flush flagged bio with write
data. There is no such composite NVMe command, so the driver sends flush
and write separately.

The device is allowed to execute these commands in any order, so it was
possible the driver ends the bio after the write completes, but while the
flush is still active. We don't want to let a filesystem believe flush
succeeded before it really has; this could cause data corruption on a
power loss between these events. To fix, this patch splits the flush
and write into chained bios.

Signed-off-by: Keith Busch
Signed-off-by: Matthew Wilcox

Keith Busch
2014-05-05 22:54:02 +0800
a7d2ce283 NVMe: Configure support for block flush ... Browse Code »

This configures an nvme request_queue as flush capable if the device
has a volatile write cache present.

Signed-off-by: Keith Busch
Signed-off-by: Matthew Wilcox

Keith Busch
2014-05-05 22:53:53 +0800
8757ad65d NVMe: Update copyright headers ... Browse Code »

Make the copyright dates accurate and remove the final paragraph that
includes the address of the FSF.

Signed-off-by: Matthew Wilcox

Matthew Wilcox
2014-05-05 22:41:25 +0800

12 Apr, 2014

1 commit

3e8072d48 Merge git://git.infradead.org/users/willy/linux-nvme ... Browse Code »

Pull NVMe driver updates from Matthew Wilcox:
"Various updates to the NVMe driver. The most user-visible change is
that drive hotplugging now works and CPU hotplug while an NVMe drive
is installed should also work better"

* git://git.infradead.org/users/willy/linux-nvme:
NVMe: Retry failed commands with non-fatal errors
NVMe: Add getgeo to block ops
NVMe: Start-stop nvme_thread during device add-remove.
NVMe: Make I/O timeout a module parameter
NVMe: CPU hot plug notification
NVMe: per-cpu io queues
NVMe: Replace DEFINE_PCI_DEVICE_TABLE
NVMe: Fix divide-by-zero in nvme_trans_io_get_num_cmds
NVMe: IOCTL path RCU protect queue access
NVMe: RCU protected access to io queues
NVMe: Initialize device reference count earlier
NVMe: Add CONFIG_PM_SLEEP to suspend/resume functions

Linus Torvalds
2014-04-12 07:45:59 +0800

11 Apr, 2014

4 commits

edd10d332 NVMe: Retry failed commands with non-fatal errors ... Browse Code »

For commands returned with failed status, queue these for resubmission
and continue retrying them until success or for a limited amount of
time. The final timeout was arbitrarily chosen so requests can't be
retried indefinitely.

Since these are requeued on the nvmeq that submitted the command, the
callbacks have to take an nvmeq instead of an nvme_dev as a parameter
so that we can use the locked queue to append the iod to retry later.

The nvme_iod conviently can be used to track how long we've been trying
to successfully complete an iod request. The nvme_iod also provides the
nvme prp dma mappings, so I had to move a few things around so we can
keep those mappings.

Signed-off-by: Keith Busch
[fixed checkpatch issue with long line]
Signed-off-by: Matthew Wilcox

Keith Busch
2014-04-11 05:11:59 +0800
b355084a8 NVMe: Make I/O timeout a module parameter ... Browse Code »

Increase the default timeout to 30 seconds to match SCSI.

Signed-off-by: Keith Busch
[use byte instead of ushort]
Signed-off-by: Matthew Wilcox

Keith Busch
2014-04-11 05:04:38 +0800
33b1e95c9 NVMe: CPU hot plug notification ... Browse Code »

Registers with hot cpu notification to rebalance, and potentially allocate
additional, io queues.

Signed-off-by: Keith Busch
Signed-off-by: Matthew Wilcox

Keith Busch
2014-04-11 05:03:42 +0800
42f614201 NVMe: per-cpu io queues ... Browse Code »

The device's IO queues are associated with CPUs, so we can use a per-cpu
variable to map the a qid to a cpu. This provides a convienient way
to optimally assign queues to multiple cpus when the device supports
fewer queues than the host has cpus. The previous implementation may
have assigned these poorly in these situations. This patch addresses
this by sharing queues among cpus that are "close" together and should
have a lower lock contention penalty.

Signed-off-by: Keith Busch
Signed-off-by: Matthew Wilcox

Keith Busch
2014-04-11 05:03:15 +0800

24 Mar, 2014

2 commits

4f5099af4 NVMe: IOCTL path RCU protect queue access ... Browse Code »

This adds rcu protected access to a queue in the nvme IOCTL path
to fix potential races between a surprise removal and queue usage in
nvme_submit_sync_cmd. The fix holds the rcu_read_lock() here to prevent
the nvme_queue from freeing while this path is executing so it can't
sleep, and so this path will no longer wait for a available command
id should they all be in use at the time a passthrough IOCTL request
is received.

Signed-off-by: Keith Busch
Signed-off-by: Matthew Wilcox

Keith Busch
2014-03-24 20:54:40 +0800
5a92e700a NVMe: RCU protected access to io queues ... Browse Code »

This adds rcu protected access to nvme_queue to fix a race between a
surprise removal freeing the queue and a thread with open reference on
a NVMe block device using that queue.

The queues do not need to be rcu protected during the initialization or
shutdown parts, so I've added a helper function for raw deferencing
to get around the sparse errors.

There is still a hole in the IOCTL path for the same problem, which is
fixed in a subsequent patch.

Signed-off-by: Keith Busch
Signed-off-by: Matthew Wilcox

Keith Busch
2014-03-24 20:45:57 +0800

07 Mar, 2014

1 commit

9ca973744 nvme: don't use PREPARE_WORK ... Browse Code »

PREPARE_[DELAYED_]WORK() are being phased out. They have few users
and a nasty surprise in terms of reentrancy guarantee as workqueue
considers work items to be different if they don't have the same work
function.

nvme_dev->reset_work is multiplexed with multiple work functions.
Introduce nvme_reset_workfn() which invokes nvme_dev->reset_workfn and
always use it as the work function and update the users to set the
->reset_workfn field instead of overriding the work function using
PREPARE_WORK().

It would probably be best to route this with other related updates
through the workqueue tree.

Compile tested.

Signed-off-by: Tejun Heo
Cc: Matthew Wilcox
Cc: linux-nvme@lists.infradead.org

Tejun Heo
2014-03-07 23:24:49 +0800

28 Jan, 2014

2 commits

c30341dc3 NVMe: Abort timed out commands ... Browse Code »

Send nvme abort command to io requests that have timed out on an
initialized device. If the command is not returned after another timeout,
schedule the controller for reset.

Signed-off-by: Keith Busch
[fix endianness issues]
Signed-off-by: Matthew Wilcox

Keith Busch
2014-01-28 08:27:53 +0800
d4b4ff8e2 NVMe: Schedule reset for failed controllers ... Browse Code »

Schedules a controller reset when it indicates it has a failed status. If
the device does not become ready after a reset, the pci device will be
scheduled for removal.

Signed-off-by: Keith Busch
[fixed checkpatch issue]
Signed-off-by: Matthew Wilcox

Keith Busch
2014-01-28 08:20:02 +0800

17 Dec, 2013

2 commits

9a6b94584 NVMe: Device resume error handling ... Browse Code »

Adds controller error handling on resume power management. If the device
fails to initialize, the device is queued for a reset. If the reset fails,
a thread is spawned to remove the pci device.

If the device resumes as "busy", the device is responding to admin
commands but will not create IO queues. In this case, we need to remove
the gendisks and free the IO queues since they can't be used and may be
holding bios in their lists.

From testing, the dma pools require a pci device so this had to change
the pci driver 'remove' to release the dma resources in line with that
call instead of after all references to the device are released.

Signed-off-by: Keith Busch
Signed-off-by: Matthew Wilcox

Keith Busch
2013-12-17 04:54:39 +0800
320a38274 NVMe: compat SG_IO ioctl ... Browse Code »

For 32-bit versions of sg3-utils running on a 64-bit system. This is
mostly a copy from the relevent portions of fs/compat_ioctl.c, with
slight modifications for going through block_device_operations.

Signed-off-by: Keith Busch
Reviewed-by: Vishal Verma
[fixed up CONFIG_COMPAT=n build problems]
Signed-off-by: Matthew Wilcox

Keith Busch
2013-12-17 04:49:40 +0800