20 Jan, 2017
1 commit
-
commit b5a10c5f7532b7473776da87e67f8301bbc32693 upstream.
Commit 54adc01055b7 ("nvme/quirk: Add a delay before checking for adapter
readiness") introduced a quirk to adapters that cannot read the bit
NVME_CSTS_RDY right after register NVME_REG_CC is set; these adapters
need a delay or else the action of reading the bit NVME_CSTS_RDY could
somehow corrupt adapter's registers state and it never recovers.When this quirk was added, we checked ctrl->tagset in order to avoid
quirking in probe time, supposing we would never require such delay
during probe. Well, it was too optimistic; we in fact need this quirk
at probe time in some cases, like after a kexec.In some experiments, after abnormal shutdown of machine (aka power cord
unplug), we booted into our bootloader in Power, which is a Linux kernel,
and kexec'ed into another distro. If this kexec is too quick, we end up
reaching the probe of NVMe adapter in that distro when adapter is in
bad state (not fully initialized on our bootloader). What happens next
is that nvme_wait_ready() is unable to complete, except if the quirk is
enabled.So, this patch removes the original ctrl->tagset verification in order
to enable the quirk even on probe time.Fixes: 54adc01055b7 ("nvme/quirk: Add a delay before checking for adapter readiness")
Reported-by: Andrew Byrne
Reported-by: Jaime A. H. Gomez
Reported-by: Zachary D. Myers
Signed-off-by: Guilherme G. Piccoli
Acked-by: Jeffrey Lien
Signed-off-by: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman
06 Jan, 2017
1 commit
-
commit e4fcf07cca6a3b6c4be00df16f08be894325eaa3 upstream.
When removing a namespace we delete it from the subsystem namespaces
list with list_del_init which allows us to know if it is enabled or
not.The problem is that list_del_init initialize the list next and does
not respect the RCU list-traversal we do on the IO path for locating
a namespace. Instead we need to use list_del_rcu which is allowed to
run concurrently with the _rcu list-traversal primitives (keeps list
next intact) and guarantees concurrent nvmet_find_naespace forward
progress.By changing that, we cannot rely on ns->dev_link for knowing if the
namspace is enabled, so add enabled indicator entry to nvmet_ns for
that.Signed-off-by: Sagi Grimberg
Signed-off-by: Solganik Alexander
Signed-off-by: Greg Kroah-Hartman
17 Nov, 2016
1 commit
-
The nvme_remove function tears down all allocated resources in the correct
order, so no need to free queues on error during initialization. This
fixes possible use-after-free errors when queues are still associated
with a blk-mq hctx.Reported-by: Scott Bauer
Tested-by: Scott Bauer
Signed-off-by: Keith Busch
Reviewed-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe
14 Nov, 2016
6 commits
-
draining the qp right after disconnect might not suffice because
the nvmet sq is not fully drained (in nvmet_sq_destroy) and we might
see completions after the drain. Instead, drain right before the
qp destroy which comes after the sq destruction and we can be sure
that no posts come after the drain.Tested-by: Steve Wise
Signed-off-by: Sagi Grimberg -
While testing nvme-rdma with the spdk nvmf target over iw_cxgb4, I
configured the target (mistakenly) to generate an error creating the
NVMF IO queues. This resulted a "Invalid SQE Parameter" error sent back
to the host on the first IO queue connect:[ 9610.928182] nvme nvme1: queue_size 128 > ctrl maxcmd 120, clamping down
[ 9610.938745] nvme nvme1: creating 32 I/O queues.So nvmf_connect_io_queue() returns an error to
nvmf_connect_io_queue() / nvmf_connect_io_queues(), and that
is returned to nvme_rdma_create_io_queues(). In the error path,
nvmf_rdma_create_io_queues() frees the queue tagset memory _before_
stopping and freeing the IB queues, which causes yet another
touch-after-free crash due to SQ CQEs being flushed after the ib_cqe
structs pointed-to by the flushed WRs have been freed (since they are
part of the nvme_rdma_request struct).The fix is to stop and free the queues in nvmf_connect_io_queues()
if there is an error connecting any of the queues.Signed-off-by: Steve Wise
Signed-off-by: Sagi Grimberg -
In case we accepted a queue connection and it failed, we might not
remove the queue from the list until we unload and clean it up.
We should delete it from the queue list on the relevant handler.Signed-off-by: Sagi Grimberg
-
In the transport, in case of an interal queue error like
error completion in rdma we trigger a fatal error. However,
multiple queues in the same controller can serr error completions
and we don't want to trigger fatal error work more than once.Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg -
If we reconncect we might have command queue up that get resent as soon
as the queue is restarted. But until the connect command succeeded we
can't send other command. Add a new flag that marks a queue as live when
connect finishes, and delay any non-connect command until the queue is
live based on it.Signed-off-by: Christoph Hellwig
Reported-by: Steve Wise
Tested-by: Steve Wise
[sagig: fixes admin queue LIVE setting]
Signed-off-by: Sagi Grimberg -
When we initiate queue teardown sequence we call rdma_destroy_qp
which clears cm_id->qp, afterwards we call rdma_destroy_id, but
we might see a rdma_cm event in between with a cleared cm_id->qp
so watch out for that and silently ignore the event because this
means that the queue teardown sequence is in progress.Signed-off-by: Bart Van Assche
Signed-off-by: Sagi Grimberg
12 Nov, 2016
1 commit
-
The ns->lba_shift assumes its value to be the logarithmic of the
LA size. A previous patch duplicated the lba_shift calculation into
lightnvm. It prematurely also subtracted a 512byte shift, which commonly
is applied per-command. The 512byte shift being subtracted twice led to
data loss when restoring the logical to physical mapping table from
device and when issuing I/O commands using rrpc.Fix offset by removing the 512byte shift subtraction when calculating
lba_shift.Fixes: b0b4e09c1ae7 "lightnvm: control life of nvm_dev in driver"
Reported-by: Javier González
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe
22 Oct, 2016
1 commit
-
Pull block fixes from Jens Axboe:
"A set of fixes that missed the merge window, mostly due to me being
away around that time.Nothing major here, a mix of nvme cleanups and fixes, and one fix for
the badblocks handling"* 'for-linus' of git://git.kernel.dk/linux-block:
nvmet: use symbolic constants for CNS values
nvme: use symbolic constants for CNS values
nvme.h: add an enum for cns values
nvme.h: don't use uuid_be
nvme.h: resync with nvme-cli
nvme: Add tertiary number to NVME_VS
nvme : Add sysfs entry for NVMe CMBs when appropriate
nvme: don't schedule multiple resets
nvme: Delete created IO queues on reset
nvme: Stop probing a removed device
badblocks: fix overlapping check for clearing
20 Oct, 2016
4 commits
-
Signed-off-by: Christoph Hellwig
Reviewed-by: Gabriel Krisman Bertazi
Reviewed-by: Jay Freyensee
Signed-off-by: Jens Axboe -
Signed-off-by: Christoph Hellwig
Reviewed-by: Gabriel Krisman Bertazi
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe -
Import a few updates to nvme.h from nvme-cli. This mostly includes a few
new fields and error codes, but also a few renames that so far are only
used in user space. Also one field is moved from an array of two le64
values to one of 16 u8 values so that we can more easily access it.Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Reviewed-by: Gabriel Krisman Bertazi
Signed-off-by: Jens Axboe -
NVMe 1.2.1 specification adds a tertiary element to the version number.
This updates the macro and its callers to include the final number and
fixup a single place in nvmet where the version was generated manually.Signed-off-by: Gabriel Krisman Bertazi
Reviewed-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
13 Oct, 2016
1 commit
-
Add a sysfs attribute that contains salient information about the NVMe
Controller Memory Buffer when one is present. For now, just display the
information about the CMB available from the control registers. We attach
the CMB attribute file to the existing nvme_ctrl sysfs group so it can
handle the sysfs teardown.Reviewed-by: Sagi Grimberg
Reviewed-by: Jay Freyensee
Signed-off-by: Stephen Bates
Acked-by Jon Derrick:
Signed-off-by: Jens Axboe
12 Oct, 2016
4 commits
-
The queue_work only fails if the work is pending, but not yet running. If
the work is running, the work item would get requeued, triggering a
double reset. If the first reset fails for any reason, the second
reset triggers:WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)
Hitting that schedules controller deletion for a second time, which
potentially takes a reference on the device that is being deleted.
If the reset occurs at the same time as a hot removal event, this causes
a double-free.This patch has the reset helper function check if the work is busy
prior to queueing, and changes all places that schedule resets to use
this function. Since most users don't want to sync with that work, the
"flush_work" is moved to the only caller that wants to sync.Signed-off-by: Keith Busch
Reviewed-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
The driver was decrementing the online_queues prior to attempting to
delete those IO queues, so the driver ended up not requesting the
controller delete any. This patch saves the online_queues prior to
suspending them, and adds that parameter for deleting io queues.Fixes: c21377f8 ("nvme: Suspend all queues before deletion")
Signed-off-by: Keith Busch
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
There is no reason the nvme controller can ever return all 1's from
reading the CSTS register. This patch returns an error if we observe
that status. Without this, we may incorrectly proceed with controller
initialization and unnecessarilly rely on error handling to clean this.Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe -
Use the DMA_ATTR_NO_WARN attribute for the dma_map_sg() call of the nvme
driver that returns BLK_MQ_RQ_QUEUE_BUSY (not for BLK_MQ_RQ_QUEUE_ERROR).Link: http://lkml.kernel.org/r/1470092390-25451-4-git-send-email-mauricfo@linux.vnet.ibm.com
Signed-off-by: Mauricio Faria de Oliveira
Reviewed-by: Gabriel Krisman Bertazi
Cc: Keith Busch
Cc: Jens Axboe
Cc: Benjamin Herrenschmidt
Cc: Michael Ellerman
Cc: Krzysztof Kozlowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Oct, 2016
2 commits
-
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
"This is the block-irq topic branch for 4.9-rc. It's mostly from
Christoph, and it allows drivers to specify their own mappings, and
more importantly, to share the blk-mq mappings with the IRQ affinity
mappings. It's a good step towards making this work better out of the
box"* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
blk_mq: linux/blk-mq.h does not include all the headers it depends on
blk-mq: kill unused blk_mq_create_mq_map()
blk-mq: get rid of the cpumask in struct blk_mq_tags
nvme: remove the post_scan callout
nvme: switch to use pci_alloc_irq_vectors
blk-mq: provide a default queue mapping for PCI device
blk-mq: allow the driver to pass in a queue mapping
blk-mq: remove ->map_queue
blk-mq: only allocate a single mq_map per tag_set
blk-mq: don't redistribute hardware queues on a CPU hotplug event -
Pull main rdma updates from Doug Ledford:
"This is the main pull request for the rdma stack this release. The
code has been through 0day and I had it tagged for linux-next testing
for a couple days.Summary:
- updates to mlx5
- updates to mlx4 (two conflicts, both minor and easily resolved)
- updates to iw_cxgb4 (one conflict, not so obvious to resolve,
proper resolution is to keep the code in cxgb4_main.c as it is in
Linus' tree as attach_uld was refactored and moved into
cxgb4_uld.c)- improvements to uAPI (moved vendor specific API elements to uAPI
area)- add hns-roce driver and hns and hns-roce ACPI reset support
- conversion of all rdma code away from deprecated
create_singlethread_workqueue- security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
staging)"* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (75 commits)
staging/lustre: Disable InfiniBand support
iw_cxgb4: add fast-path for small REG_MR operations
cxgb4: advertise support for FR_NSMR_TPTE_WR
IB/core: correctly handle rdma_rw_init_mrs() failure
IB/srp: Fix infinite loop when FMR sg[0].offset != 0
IB/srp: Remove an unused argument
IB/core: Improve ib_map_mr_sg() documentation
IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets
IB/mthca: Move user vendor structures
IB/nes: Move user vendor structures
IB/ocrdma: Move user vendor structures
IB/mlx4: Move user vendor structures
IB/cxgb4: Move user vendor structures
IB/cxgb3: Move user vendor structures
IB/mlx5: Move and decouple user vendor structures
IB/{core,hw}: Add constant for node_desc
ipoib: Make ipoib_warn ratelimited
IB/mlx4/alias_GUID: Remove deprecated create_singlethread_workqueue
IB/ipoib_verbs: Remove deprecated create_singlethread_workqueue
IB/ipoib: Remove deprecated create_singlethread_workqueue
...
08 Oct, 2016
1 commit
-
Pull block layer updates from Jens Axboe:
"This is the main pull request for block layer changes in 4.9.As mentioned at the last merge window, I've changed things up and now
do just one branch for core block layer changes, and driver changes.
This avoids dependencies between the two branches. Outside of this
main pull request, there are two topical branches coming as well.This pull request contains:
- A set of fixes, and a conversion to blk-mq, of nbd. From Josef.
- Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
Followup dependency fix from Geert.- General fixes from Bart, Baoyou, Guoqing, and Linus W.
- CFQ async write starvation fix from Glauber.
- Add supprot for delayed kick of the requeue list, from Mike.
- Pull out the scalable bitmap code from blk-mq-tag.c and make it
generally available under the name of sbitmap. Only blk-mq-tag uses
it for now, but the blk-mq scheduling bits will use it as well.
From Omar.- bdev thaw error progagation from Pierre.
- Improve the blk polling statistics, and allow the user to clear
them. From Stephen.- Set of minor cleanups from Christoph in block/blk-mq.
- Set of cleanups and optimizations from me for block/blk-mq.
- Various nvme/nvmet/nvmeof fixes from the various folks"
* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
fs/block_dev.c: return the right error in thaw_bdev()
nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
nvme/scsi: Remove power management support
nvmet: Make dsm number of ranges zero based
nvmet: Use direct IO for writes
admin-cmd: Added smart-log command support.
nvme-fabrics: Add host_traddr options field to host infrastructure
nvme-fabrics: revise host transport option descriptions
nvme-fabrics: rework nvmf_get_address() for variable options
nbd: use BLK_MQ_F_BLOCKING
blkcg: Annotate blkg_hint correctly
cfq: fix starvation of asynchronous writes
blk-mq: add flag for drivers wanting blocking ->queue_rq()
blk-mq: remove non-blocking pass in blk_mq_map_request
blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
block: export bio_free_pages to other modules
lightnvm: propagate device_add() error code
lightnvm: expose device geometry through sysfs
lightnvm: control life of nvm_dev in driver
blk-mq: register device instead of disk
...
25 Sep, 2016
2 commits
-
Any user I can imagine that needs a buffer at all will want to pass
a pointer directly. There are no currently callers that use
buffers, so this change is painless, and it will make it much easier
to start using features that use buffers (e.g. APST).Signed-off-by: Andy Lutomirski
Reviewed-by: Christoph Hellwig
Acked-by: Jay Freyensee
Tested-by: Jay Freyensee
Signed-off-by: Jens Axboe -
As far as I can tell, there is basically nothing correct about this
code. It misinterprets npss (off-by-one). It hardcodes a bunch of
power states, which is nonsense, because they're all just indices
into a table that software needs to parse. It completely ignores
the distinction between operational and non-operational states.
And, until 4.8, if all of the above magically succeeded, it would
dereference a NULL pointer and OOPS.Since this code appears to be useless, just delete it.
Signed-off-by: Andy Lutomirski
Reviewed-by: Christoph Hellwig
Acked-by: Jay Freyensee
Tested-by: Jay Freyensee
Signed-off-by: Jens Axboe
24 Sep, 2016
8 commits
-
This caused the nvmet request data length to be
incorrect.Signed-off-by: Alexander Solganik
Signed-off-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig -
We're designed to work with high-end devices where
direct IO makes perfect sense. We noticed that we
context switch by scheduling kblockd instead of going
directly to the device without REQ_SYNC for writes.Signed-off-by: Sagi Grimberg
Reviewed-by: Jens Axboe -
This patch implements the support for smart-log command
(NVM Express 1.2.1-section 5.10.1.2 SMART / Health Information
(Log Identifier 02h)) on the target for NVMe over Fabric.In current implementation host can retrieve following statistics:-
1. Data Units Read.
2. Data Units Written.
3. Host Read Commands.
4. Host Write Commands.Signed-off-by: Chaitanya Kulkarni
Signed-off-by: Sagi Grimberg -
Add the host_traddr field to allow specification of the host-port
connection info for the transport. Will be used by FC transport.Signed-off-by: James Smart
Acked-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg -
Revise some of the comments so not so ethernet-network centric
Signed-off-by: James Smart
Acked-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg -
Revise nvmf_get_address() string to account for not all options being
present.Signed-off-by: James Smart
Acked-by: Johannes Thumshirn
Reviewed-by: Christoph Hellwig
Signed-off-by: Sagi Grimberg -
Signed-off-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Reviewed-by: Jason Gunthorpe
Reviewed-by: Steve Wise
Signed-off-by: Doug Ledford -
Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or
less unchecked, this moves the capability of creating a global rkey into
the RDMA core, where it can be easily audited. It also prints a warning
everytime this feature is used as well.Signed-off-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Reviewed-by: Jason Gunthorpe
Reviewed-by: Steve Wise
Signed-off-by: Doug Ledford
23 Sep, 2016
1 commit
-
Otherwise, nvme_rdma_stop_and_clear_queue() will incorrectly
try to stop/free rdma qps/cm_ids that are already freed.Fixes: e89ca58f9c90 ("nvme-rdma: add DELETING queue flag")
Reported-by: Steve Wise
Tested-by: Steve Wise
Signed-off-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
21 Sep, 2016
3 commits
-
For a host to access an Open-Channel SSD, it has to know its geometry,
so that it writes and reads at the appropriate device bounds.Currently, the geometry information is kept within the kernel, and not
exported to user-space for consumption. This patch exposes the
configuration through sysfs and enables user-space libraries, such as
liblightnvm, to use the sysfs implementation to get the geometry of an
Open-Channel SSD.The sysfs entries are stored within the device hierarchy, and can be
found using the "lightnvm" device type.An example configuration looks like this:
/sys/class/nvme/
└── nvme0n1
├── capabilities: 3
├── device_mode: 1
├── erase_max: 1000000
├── erase_typ: 1000000
├── flash_media_type: 0
├── media_capabilities: 0x00000001
├── media_type: 0
├── multiplane: 0x00010101
├── num_blocks: 1022
├── num_channels: 1
├── num_luns: 4
├── num_pages: 64
├── num_planes: 1
├── page_size: 4096
├── prog_max: 100000
├── prog_typ: 100000
├── read_max: 10000
├── read_typ: 10000
├── sector_oob_size: 0
├── sector_size: 4096
├── media_manager: gennvm
├── ppa_format: 0x380830082808001010102008
├── vendor_opcode: 0
├── max_phys_secs: 64
└── version: 1Signed-off-by: Simon A. F. Lund
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
LightNVM compatible device drivers does not have a method to expose
LightNVM specific sysfs entries.To enable LightNVM sysfs entries to be exposed, lightnvm device
drivers require a struct device to attach it to. To allow both the
actual device driver and lightnvm sysfs entries to coexist, the device
driver tracks the lifetime of the nvm_dev structure.This patch refactors NVMe and null_blk to handle the lifetime of struct
nvm_dev, which eliminates the need for struct gendisk when a lightnvm
compatible device is provided.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe -
With LightNVM enabled namespaces, the gendisk structure is not exposed
to the user. This prevents LightNVM users from accessing the NVMe device
driver specific sysfs entries, and LightNVM namespace geometry.Refactor the revalidation process, so that a namespace, instead of a
gendisk, is revalidated. This later allows patches to wire up the
sysfs entries up to a non-gendisk namespace.Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe
16 Sep, 2016
1 commit
-
Pull block fixes from Jens Axboe:
"A set of fixes for the current series in the realm of block.Like the previous pull request, the meat of it are fixes for the nvme
fabrics/target code. Outside of that, just one fix from Gabriel for
not doing a queue suspend if we didn't get the admin queue setup in
the first place"* 'for-linus' of git://git.kernel.dk/linux-block:
nvme-rdma: add back dependency on CONFIG_BLOCK
nvme-rdma: fix null pointer dereference on req->mr
nvme-rdma: use ib_client API to detect device removal
nvme-rdma: add DELETING queue flag
nvme/quirk: Add a delay before checking device ready for memblaze device
nvme: Don't suspend admin queue that wasn't created
nvme-rdma: destroy nvme queue rdma resources on connect failure
nvme_rdma: keep a ref on the ctrl during delete/flush
iw_cxgb4: block module unload until all ep resources are released
iw_cxgb4: call dev_put() on l2t allocation failure
15 Sep, 2016
2 commits
-
No need now that we don't have to reverse engineer the irq affinity.
Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe -
Use the new helper to automatically select the right interrupt type, as
well as to use the automatic interupt affinity assignment.Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe