Eric Lee / smarc-fsl-linux-kernel

21 Mar, 2020

1 commit

98fd5c723 nvmet-tcp: set MSG_MORE only if we actually have more to send ... Browse Code »

When we send PDU data, we want to optimize the tcp stack
operation if we have more data to send. So when we set MSG_MORE
when:
- We have more fragments coming in the batch, or
- We have a more data to send in this PDU
- We don't have a data digest trailer
- We optimize with the SUCCESS flag and omit the NVMe completion
(used if sq_head pointer update is disabled)

This addresses a regression in QD=1 with SUCCESS flag optimization
as we unconditionally set MSG_MORE when we didn't actually have
more data to send.

Fixes: 70583295388a ("nvmet-tcp: implement C2HData SUCCESS optimization")
Reported-by: Mark Wunderlich
Tested-by: Mark Wunderlich
Signed-off-by: Sagi Grimberg
Signed-off-by: Keith Busch

Sagi Grimberg
2020-03-21 03:37:53 +0800

11 Mar, 2020

1 commit

9134ae2a2 nvme-rdma: Avoid double freeing of async event data ... Browse Code »

The timeout of identify cmd, which is invoked as part of admin queue
creation, can result in freeing of async event data both in
nvme_rdma_timeout handler and error handling path of
nvme_rdma_configure_admin queue thus causing NULL pointer reference.
Call Trace:
? nvme_rdma_setup_ctrl+0x223/0x800 [nvme_rdma]
nvme_rdma_create_ctrl+0x2ba/0x3f7 [nvme_rdma]
nvmf_dev_write+0xa54/0xcc6 [nvme_fabrics]
__vfs_write+0x1b/0x40
vfs_write+0xb2/0x1b0
ksys_write+0x61/0xd0
__x64_sys_write+0x1a/0x20
do_syscall_64+0x60/0x1e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reviewed-by: Roland Dreier
Reviewed-by: Max Gurtovoy
Reviewed-by: Christoph Hellwig
Signed-off-by: Prabhath Sajeepa
Signed-off-by: Keith Busch

Prabhath Sajeepa
2020-03-11 04:57:39 +0800

28 Feb, 2020

1 commit

9515743bf nvme-pci: Hold cq_poll_lock while completing CQEs ... Browse Code »

Completions need to consumed in the same order the controller submitted
them, otherwise future completion entries may overwrite ones we haven't
handled yet. Hold the nvme queue's poll lock while completing new CQEs to
prevent another thread from freeing command tags for reuse out-of-order.

Fixes: dabcefab45d3 ("nvme: provide optimized poll function for separate poll queues")
Signed-off-by: Bijan Mottahedeh
Reviewed-by: Sagi Grimberg
Reviewed-by: Jens Axboe
Signed-off-by: Keith Busch

Bijan Mottahedeh
2020-02-28 00:32:14 +0800

23 Feb, 2020

1 commit

f6c69b7f5 Merge tag 'block-5.6-2020-02-22' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"Just a set of NVMe fixes via Keith"

* tag 'block-5.6-2020-02-22' of git://git.kernel.dk/linux-block:
nvme-multipath: Fix memory leak with ana_log_buf
nvme: Fix uninitialized-variable warning
nvme-pci: Use single IRQ vector for old Apple models
nvme/pci: Add sleep quirk for Samsung and Toshiba drives

Linus Torvalds
2020-02-23 03:09:06 +0800

21 Feb, 2020

1 commit

3b7830904 nvme-multipath: Fix memory leak with ana_log_buf ... Browse Code »

kmemleak reports a memory leak with the ana_log_buf allocated by
nvme_mpath_init():

unreferenced object 0xffff888120e94000 (size 8208):
comm "nvme", pid 6884, jiffies 4295020435 (age 78786.312s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[] kmalloc_order+0x97/0xc0
[] kmalloc_order_trace+0x24/0x100
[] __kmalloc+0x24c/0x2d0
[] nvme_mpath_init+0x23c/0x2b0
[] nvme_init_identify+0x75f/0x1600
[] nvme_loop_configure_admin_queue+0x26d/0x280
[] nvme_loop_create_ctrl+0x2a7/0x710
[] nvmf_dev_write+0xc66/0x10b9
[] __vfs_write+0x50/0xa0
[] vfs_write+0xf3/0x280
[] ksys_write+0xc6/0x160
[] __x64_sys_write+0x43/0x50
[] do_syscall_64+0x77/0x2f0
[] entry_SYSCALL_64_after_hwframe+0x49/0xbe

nvme_mpath_init() is called by nvme_init_identify() which is called in
multiple places (nvme_reset_work(), nvme_passthru_end(), etc). This
means nvme_mpath_init() may be called multiple times before
nvme_mpath_uninit() (which is only called on nvme_free_ctrl()).

When nvme_mpath_init() is called multiple times, it overwrites the
ana_log_buf pointer with a new allocation, thus leaking the previous
allocation.

To fix this, free ana_log_buf before allocating a new one.

Fixes: 0d0b660f214dc490 ("nvme: add ANA support")
Cc:
Reviewed-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Signed-off-by: Logan Gunthorpe
Signed-off-by: Keith Busch

Logan Gunthorpe
2020-02-21 22:52:25 +0800

20 Feb, 2020

1 commit

15755854d nvme: Fix uninitialized-variable warning ... Browse Code »

gcc may detect a false positive on nvme using an unintialized variable
if setting features fails. Since this is not a fast path, explicitly
initialize this variable to suppress the warning.

Reported-by: Arnd Bergmann
Reviewed-by: Christoph Hellwig
Signed-off-by: Keith Busch

Keith Busch
2020-02-20 00:40:57 +0800

19 Feb, 2020

2 commits

98f7b86a0 nvme-pci: Use single IRQ vector for old Apple models ... Browse Code »

People reported that old Apple machines are not working properly
if the non-first IRQ vector is in use.

Set quirk for that models to limit IRQ to use first vector only.

Based on original patch by GitHub user npx001.

Link: https://github.com/Dunedan/mbp-2016-linux/issues/9
Cc: Benjamin Herrenschmidt
Cc: Leif Liddy
Signed-off-by: Andy Shevchenko
Signed-off-by: Keith Busch

Andy Shevchenko
2020-02-19 23:30:58 +0800
1fae37acc nvme/pci: Add sleep quirk for Samsung and Toshiba drives ... Browse Code »

The Samsung SSD SM981/PM981 and Toshiba SSD KBG40ZNT256G on the Lenovo
C640 platform experience runtime resume issues when the SSDs are kept in
sleep/suspend mode for long time.

This patch applies the 'Simple Suspend' quirk to these configurations.
With this patch, the issue had not been observed in a 1+ day test.

Reviewed-by: Jon Derrick
Reviewed-by: Christoph Hellwig
Signed-off-by: Shyjumon N
Signed-off-by: Keith Busch

Shyjumon N
2020-02-19 23:29:39 +0800

17 Feb, 2020

1 commit

e29c6a13d Merge tag 'block-5.6-2020-02-16' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"Not a lot here, which is great, basically just three small bcache
fixes from Coly, and four NVMe fixes via Keith"

* tag 'block-5.6-2020-02-16' of git://git.kernel.dk/linux-block:
nvme: fix the parameter order for nvme_get_log in nvme_get_fw_slot_info
nvme/pci: move cqe check after device shutdown
nvme: prevent warning triggered by nvme_stop_keep_alive
nvme/tcp: fix bug on double requeue when send fails
bcache: remove macro nr_to_fifo_front()
bcache: Revert "bcache: shrink btree node cache after bch_btree_check()"
bcache: ignore pending signals when creating gc and allocator thread

Linus Torvalds
2020-02-17 04:35:52 +0800

15 Feb, 2020

4 commits

f25372ffc nvme: fix the parameter order for nvme_get_log in nvme_get_fw_slot_info ... Browse Code »

nvme fw-activate operation will get bellow warning log,
fix it by update the parameter order

[ 113.231513] nvme nvme0: Get FW SLOT INFO log error

Fixes: 0e98719b0e4b ("nvme: simplify the API for getting log pages")
Reported-by: Sujith Pandel
Reviewed-by: David Milburn
Signed-off-by: Yi Zhang
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Yi Zhang
2020-02-15 01:12:04 +0800
fa46c6fb5 nvme/pci: move cqe check after device shutdown ... Browse Code »

Many users have reported nvme triggered irq_startup() warnings during
shutdown. The driver uses the nvme queue's irq to synchronize scanning
for completions, and enabling an interrupt affined to only offline CPUs
triggers the alarming warning.

Move the final CQE check to after disabling the device and all
registered interrupts have been torn down so that we do not have any
IRQ to synchronize.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206509
Reviewed-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Keith Busch
2020-02-15 01:12:04 +0800
97b2512ad nvme: prevent warning triggered by nvme_stop_keep_alive ... Browse Code »

Delayed keep alive work is queued on system workqueue and may be cancelled
via nvme_stop_keep_alive from nvme_reset_wq, nvme_fc_wq or nvme_wq.

Check_flush_dependency detects mismatched attributes between the work-queue
context used to cancel the keep alive work and system-wq. Specifically
system-wq does not have the WQ_MEM_RECLAIM flag, whereas the contexts used
to cancel keep alive work have WQ_MEM_RECLAIM flag.

Example warning:

workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_fc_reset_ctrl_work [nvme_fc]
is flushing !WQ_MEM_RECLAIM events:nvme_keep_alive_work [nvme_core]

To avoid the flags mismatch, delayed keep alive work is queued on nvme_wq.

However this creates a secondary concern where work and a request to cancel
that work may be in the same work queue - namely err_work in the rdma and
tcp transports, which will want to flush/cancel the keep alive work which
will now be on nvme_wq.

After reviewing the transports, it looks like err_work can be moved to
nvme_reset_wq. In fact that aligns them better with transition into
RESETTING and performing related reset work in nvme_reset_wq.

Change nvme-rdma and nvme-tcp to perform err_work in nvme_reset_wq.

Signed-off-by: Nigel Kirkland
Signed-off-by: James Smart
Reviewed-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Nigel Kirkland
2020-02-15 01:12:04 +0800
2d570a7c0 nvme/tcp: fix bug on double requeue when send fails ... Browse Code »

When nvme_tcp_io_work() fails to send to socket due to
connection close/reset, error_recovery work is triggered
from nvme_tcp_state_change() socket callback.
This cancels all the active requests in the tagset,
which requeues them.

The failed request, however, was ended and thus requeued
individually as well unless send returned -EPIPE.
Another return code to be treated the same way is -ECONNRESET.

Double requeue caused BUG_ON(blk_queued_rq(rq))
in blk_mq_requeue_request() from either the individual requeue
of the failed request or the bulk requeue from
blk_mq_tagset_busy_iter(, nvme_cancel_request, );

Signed-off-by: Anton Eidelman
Reviewed-by: Sagi Grimberg
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Anton Eidelman
2020-02-15 01:12:04 +0800

06 Feb, 2020

1 commit

ed535f2c9 Merge tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block ... Browse Code »

Pull more block updates from Jens Axboe:
"Some later arrivals, but all fixes at this point:

- bcache fix series (Coly)

- Series of BFQ fixes (Paolo)

- NVMe pull request from Keith with a few minor NVMe fixes

- Various little tweaks"

* tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block: (23 commits)
nvmet: update AEN list and array at one place
nvmet: Fix controller use after free
nvmet: Fix error print message at nvmet_install_queue function
brd: check and limit max_part par
nvme-pci: remove nvmeq->tags
nvmet: fix dsm failure when payload does not match sgl descriptor
nvmet: Pass lockdep expression to RCU lists
block, bfq: clarify the goal of bfq_split_bfqq()
block, bfq: get a ref to a group when adding it to a service tree
block, bfq: remove ifdefs from around gets/puts of bfq groups
block, bfq: extend incomplete name of field on_st
block, bfq: get extra ref to prevent a queue from being freed during a group move
block, bfq: do not insert oom queue into position tree
block, bfq: do not plug I/O for bfq_queues with no proc refs
bcache: check return value of prio_read()
bcache: fix incorrect data type usage in btree_flush_write()
bcache: add readahead cache policy options via sysfs interface
bcache: explicity type cast in bset_bkey_last()
bcache: fix memory corruption in bch_cache_accounting_clear()
xen/blkfront: limit allocated memory size to actual use case
...

Linus Torvalds
2020-02-06 14:15:23 +0800

05 Feb, 2020

3 commits

0f5be6a4f nvmet: update AEN list and array at one place ... Browse Code »

All async events are enqueued via nvmet_add_async_event() which
updates the ctrl->async_event_cmds[] array and additionally an struct
nvmet_async_event is added to the ctrl->async_events list.

Under normal operations the nvmet_async_event_work() updates again
the ctrl->async_event_cmds and removes the corresponding struct
nvmet_async_event from the list again. Though nvmet_sq_destroy() could
be called which calls nvmet_async_events_free() which only updates the
ctrl->async_event_cmds[] array.

Add new functions nvmet_async_events_process() and
nvmet_async_events_free() to process async events, update an array and
the list.

When we destroy submission queue after clearing the aen present on
the ctrl->async list we also loop over ctrl->async_event_cmds[] for
any requests posted by the host for which we don't have the AEN in
the ctrl->async_events list by calling nvmet_async_event_process()
and nvmet_async_events_free().

Reviewed-by: Christoph Hellwig
Signed-off-by: Daniel Wagner
[chaitanya.kulkarni@wdc.com
* Loop over and clear out outstanding requests
* Update changelog
]
Signed-off-by: Chaitanya Kulkarni
Signed-off-by: Keith Busch

Daniel Wagner
2020-02-05 00:56:10 +0800
1a3f540d6 nvmet: Fix controller use after free ... Browse Code »

After nvmet_install_queue() sets sq->ctrl calling to nvmet_sq_destroy()
reduces the controller refcount. In case nvmet_install_queue() fails,
calling to nvmet_ctrl_put() is done twice (at nvmet_sq_destroy and
nvmet_execute_io_connect/nvmet_execute_admin_connect) instead of once for
the queue which leads to use after free of the controller. Fix this by set
NULL at sq->ctrl in case of a failure at nvmet_install_queue().

The bug leads to the following Call Trace:

[65857.994862] refcount_t: underflow; use-after-free.
[65858.108304] Workqueue: events nvmet_rdma_release_queue_work [nvmet_rdma]
[65858.115557] RIP: 0010:refcount_warn_saturate+0xe5/0xf0
[65858.208141] Call Trace:
[65858.211203] nvmet_sq_destroy+0xe1/0xf0 [nvmet]
[65858.216383] nvmet_rdma_release_queue_work+0x37/0xf0 [nvmet_rdma]
[65858.223117] process_one_work+0x167/0x370
[65858.227776] worker_thread+0x49/0x3e0
[65858.232089] kthread+0xf5/0x130
[65858.235895] ? max_active_store+0x80/0x80
[65858.240504] ? kthread_bind+0x10/0x10
[65858.244832] ret_from_fork+0x1f/0x30
[65858.249074] ---[ end trace f82d59250b54beb7 ]---

Fixes: bb1cc74790eb ("nvmet: implement valid sqhd values in completions")
Fixes: 1672ddb8d691 ("nvmet: Add install_queue callout")
Signed-off-by: Israel Rukshin
Reviewed-by: Max Gurtovoy
Reviewed-by: Christoph Hellwig
Signed-off-by: Keith Busch

Israel Rukshin
2020-02-05 00:13:09 +0800
0b87a2b79 nvmet: Fix error print message at nvmet_install_queue function ... Browse Code »

Place the arguments in the correct order.

Fixes: 1672ddb8d691 ("nvmet: Add install_queue callout")
Signed-off-by: Israel Rukshin
Reviewed-by: Max Gurtovoy
Reviewed-by: Christoph Hellwig
Signed-off-by: Keith Busch

Israel Rukshin
2020-02-05 00:13:06 +0800

04 Feb, 2020

3 commits

cfa27356f nvme-pci: remove nvmeq->tags ... Browse Code »

There is no real need to have a pointer to the tagset in
struct nvme_queue, as we only need it in a single place, and that place
can derive the used tagset from the device and qid trivially. This
fixes a problem with stale pointer exposure when tagsets are reset,
and also shrinks the nvme_queue structure. It also matches what most
other transports have done since day 1.

Reported-by: Edmund Nadolski
Signed-off-by: Christoph Hellwig
Signed-off-by: Keith Busch

Christoph Hellwig
2020-02-04 02:00:25 +0800
b716e6889 nvmet: fix dsm failure when payload does not match sgl descriptor ... Browse Code »

The host is allowed to pass the controller an sgl describing a buffer
that is larger than the dsm payload itself, allow it when executing
dsm.

Reported-by: Dakshaja Uppalapati
Reviewed-by: Christoph Hellwig ,
Reviewed-by: Max Gurtovoy
Signed-off-by: Sagi Grimberg
Signed-off-by: Keith Busch

Sagi Grimberg
2020-02-04 02:00:24 +0800
4ac76436a nvmet: Pass lockdep expression to RCU lists ... Browse Code »

ctrl->subsys->namespaces and subsys->namespaces are traversed with
list_for_each_entry_rcu outside an RCU read-side critical section but
under the protection of ctrl->subsys->lock and subsys->lock respectively.

Hence, add the corresponding lockdep expression to the list traversal
primitive to silence false-positive lockdep warnings, and harden RCU
lists.

Reported-by: kbuild test robot
Reviewed-by: Joel Fernandes (Google)
Signed-off-by: Amol Grover
Signed-off-by: Keith Busch

Amol Grover
2020-02-04 02:00:24 +0800

01 Feb, 2020

1 commit

7724cd2bf nvme: hwmon: switch to use <linux/units.h> helpers ... Browse Code »

This switches the nvme driver to use kelvin_to_millicelsius() and
millicelsius_to_kelvin() in .

Link: http://lkml.kernel.org/r/1576386975-7941-8-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita
Reviewed-by: Christoph Hellwig
Reviewed-by: Guenter Roeck
Reviewed-by: Keith Busch
Reviewed-by: Andy Shevchenko
Cc: Sujith Thomas
Cc: Darren Hart
Cc: Andy Shevchenko
Cc: Zhang Rui
Cc: Daniel Lezcano
Cc: Amit Kucheria
Cc: Jean Delvare
Cc: Guenter Roeck
Cc: Keith Busch
Cc: Jens Axboe
Cc: Christoph Hellwig
Cc: Sagi Grimberg
Cc: Emmanuel Grumbach
Cc: Hartmut Knaack
Cc: Johannes Berg
Cc: Jonathan Cameron
Cc: Jonathan Cameron
Cc: Kalle Valo
Cc: Lars-Peter Clausen
Cc: Luca Coelho
Cc: Peter Meerwald-Stadler
Cc: Stanislaw Gruszka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2020-02-01 02:30:40 +0800

28 Jan, 2020

1 commit

48b4b4ff1 Merge tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block updates from Jens Axboe:
"This may be the most quiet round we've had in years. I'm not
complaining. Really not a lot to detail here, outside of spelling and
documentation improvements/fixes, we have:

- Allow t10-pi to be modular (Herbert)

- Remove dead code in bfq (Alex)

- Mark zone management requests with REQ_SYNC (Chaitanya)

- BFQ division improvement (Wen)

- Small series improving plugging (Pavel)"

* tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block:
partitions/ldm: fix spelling mistake "to" -> "too"
block, bfq: improve arithmetic division in bfq_delta()
block/bfq: remove unused bfq_class_rt which never used
block: mark zone-mgmt bios with REQ_SYNC
blk-mq: Document functions for sending request
block: Allow t10-pi to be modular
blk-mq: optimise blk_mq_flush_plug_list()
list: introduce list_for_each_continue()
blk-mq: optimise rq sort function

Linus Torvalds
2020-01-28 04:38:25 +0800

10 Jan, 2020

2 commits

e17016f6d nvmet: fix per feat data len for get_feature ... Browse Code »

The existing implementation for the get_feature admin-cmd does not
use per-feature data len. This patch introduces a new helper function
nvmet_feat_data_len(), which is used to calculate per feature data len.
Right now we only set data len for fid 0x81 (NVME_FEAT_HOST_ID).

Fixes: commit e9061c397839 ("nvmet: Remove the data_len field from the nvmet_req struct")

Reviewed-by: Christoph Hellwig
Signed-off-by: Amit Engel
[endiness, naming, and kernel style fixes]
Signed-off-by: Chaitanya Kulkarni
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Amit Engel
2020-01-10 23:55:50 +0800
35038bffa nvme: Translate more status codes to blk_status_t ... Browse Code »

Decode interrupted command and not ready namespace nvme status codes to
BLK_STS_TARGET. These are not generic IO errors and should use a non-path
specific error so that it can use the non-failover retry path.

Reported-by: John Meneghini
Cc: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Keith Busch
2020-01-10 23:55:50 +0800

07 Jan, 2020

1 commit

a754bd5f1 block: Allow t10-pi to be modular ... Browse Code »

Currently t10-pi can only be built into the block layer which via
crc-t10dif pulls in a whole chunk of the Crypto API. In fact all
users of t10-pi work as modules and there is no reason for it to
always be built-in.

This patch adds a new hidden option for t10-pi that is selected
automatically based on BLK_DEV_INTEGRITY and whether the users
of t10-pi are built-in or not.

Signed-off-by: Herbert Xu
Signed-off-by: Jens Axboe

Herbert Xu
2020-01-07 11:59:04 +0800

14 Dec, 2019

1 commit

f1fcd7786 Merge tag 'for-linus-20191212' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:

- stable fix for the bi_size overflow. Not a corruption issue, but a
case wher we could merge but disallowed (Andreas)

- NVMe pull request via Keith, with various fixes.

- MD pull request from Song.

- Merge window regression fix for the rq passthrough stats (Logan)

- Remove unused blkcg_drain_queue() function (Guoqing)

* tag 'for-linus-20191212' of git://git.kernel.dk/linux-block:
blk-cgroup: remove blkcg_drain_queue
block: fix NULL pointer dereference in account statistics with IDE
md: make sure desc_nr less than MD_SB_DISKS
md: raid1: check rdev before reference in raid1_sync_request func
raid5: need to set STRIPE_HANDLE for batch head
block: fix "check bi_size overflow before merge"
nvme/pci: Fix read queue count
nvme/pci Limit write queue sizes to possible cpus
nvme/pci: Fix write and poll queue types
nvme/pci: Remove last_cq_head
nvme: Namepace identification descriptor list is optional
nvme-fc: fix double-free scenarios on hw queues
nvme: else following return is not needed
nvme: add error message on mismatching controller ids
nvme_fc: add module to ops template to allow module references
nvmet-loop: Avoid preallocating big SGL for data
nvme-fc: Avoid preallocating big SGL for data
nvme-rdma: Avoid preallocating big SGL for data

Linus Torvalds
2019-12-14 06:27:19 +0800

07 Dec, 2019

4 commits

dc3ecfc98 Merge branch 'nvme/for-5.5' of git://git.infradead.org/nvme into for-linus ... Browse Code »

Pull NVMe fixes from Keith

* 'nvme/for-5.5' of git://git.infradead.org/nvme:
nvme/pci: Fix read queue count
nvme/pci Limit write queue sizes to possible cpus
nvme/pci: Fix write and poll queue types
nvme/pci: Remove last_cq_head
nvme: Namepace identification descriptor list is optional
nvme-fc: fix double-free scenarios on hw queues
nvme: else following return is not needed
nvme: add error message on mismatching controller ids
nvme_fc: add module to ops template to allow module references
nvmet-loop: Avoid preallocating big SGL for data
nvme-fc: Avoid preallocating big SGL for data
nvme-rdma: Avoid preallocating big SGL for data

Jens Axboe
2019-12-07 08:27:56 +0800
7e4c6b9a5 nvme/pci: Fix read queue count ... Browse Code »

If nvme.write_queues equals the number of CPUs, the driver had decreased
the number of interrupts available such that there could only be one read
queue even if the controller could support more. Remove the interrupt
count reduction in this case. The driver wouldn't request more IRQs than
it wants queues anyway.

Reviewed-by: Jens Axboe
Signed-off-by: Keith Busch

Keith Busch
2019-12-07 01:52:47 +0800
17c331673 nvme/pci Limit write queue sizes to possible cpus ... Browse Code »

The driver can never use more queues of any type than the number of
possible CPUs, so a higher value causes the driver to allocate more
memory for IO queues than it could ever use. Limit the parameter at
module load time to the number of possible cpus.

Reviewed-by: Jens Axboe
Signed-off-by: Keith Busch

Keith Busch
2019-12-07 01:52:42 +0800
3f68baf70 nvme/pci: Fix write and poll queue types ... Browse Code »

The number of poll or write queues should never be negative. Use unsigned
types so that it's not possible to break have the driver not allocate
any queues.

Reviewed-by: Jens Axboe
Signed-off-by: Keith Busch

Keith Busch
2019-12-07 01:52:24 +0800

04 Dec, 2019

1 commit

c3bed3b20 Merge tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci ... Browse Code »

Pull PCI updates from Bjorn Helgaas:
"Enumeration:

- Warn if a host bridge has no NUMA info (Yunsheng Lin)

- Add PCI_STD_NUM_BARS for the number of standard BARs (Denis
Efremov)

Resource management:

- Fix boot-time Embedded Controller GPE storm caused by incorrect
resource assignment after ACPI Bus Check Notification (Mika
Westerberg)

- Protect pci_reassign_bridge_resources() against concurrent
addition/removal (Benjamin Herrenschmidt)

- Fix bridge dma_ranges resource list cleanup (Rob Herring)

- Add "pci=hpmmiosize" and "pci=hpmmioprefsize" parameters to control
the MMIO and prefetchable MMIO window sizes of hotplug bridges
independently (Nicholas Johnson)

- Fix MMIO/MMIO_PREF window assignment that assigned more space than
desired (Nicholas Johnson)

- Only enforce bus numbers from bridge EA if the bridge has EA
devices downstream (Subbaraya Sundeep)

- Consolidate DT "dma-ranges" parsing and convert all host drivers to
use shared parsing (Rob Herring)

Error reporting:

- Restore AER capability after resume (Mayurkumar Patel)

- Add PoisonTLPBlocked AER counter (Rajat Jain)

- Use for_each_set_bit() to simplify AER code (Andy Shevchenko)

- Fix AER kernel-doc (Andy Shevchenko)

- Add "pcie_ports=dpc-native" parameter to allow native use of DPC
even if platform didn't grant control over AER (Olof Johansson)

Hotplug:

- Avoid returning prematurely from sysfs requests to enable or
disable a PCIe hotplug slot (Lukas Wunner)

- Don't disable interrupts twice when suspending hotplug ports (Mika
Westerberg)

- Fix deadlocks when PCIe ports are hot-removed while suspended (Mika
Westerberg)

Power management:

- Remove unnecessary ASPM locking (Bjorn Helgaas)

- Add support for disabling L1 PM Substates (Heiner Kallweit)

- Allow re-enabling Clock PM after it has been disabled (Heiner
Kallweit)

- Add sysfs attributes for controlling ASPM link states (Heiner
Kallweit)

- Remove CONFIG_PCIEASPM_DEBUG, including "link_state" and "clk_ctl"
sysfs files (Heiner Kallweit)

- Avoid AMD FCH XHCI USB PME# from D0 defect that prevents wakeup on
USB 2.0 or 1.1 connect events (Kai-Heng Feng)

- Move power state check out of pci_msi_supported() (Bjorn Helgaas)

- Fix incorrect MSI-X masking on resume and revert related nvme quirk
for Kingston NVME SSD running FW E8FK11.T (Jian-Hong Pan)

- Always return devices to D0 when thawing to fix hibernation with
drivers like mlx4 that used legacy power management (previously we
only did it for drivers with new power management ops) (Dexuan Cui)

- Clear PCIe PME Status even for legacy power management (Bjorn
Helgaas)

- Fix PCI PM documentation errors (Bjorn Helgaas)

- Use dev_printk() for more power management messages (Bjorn Helgaas)

- Apply D2 delay as milliseconds, not microseconds (Bjorn Helgaas)

- Convert xen-platform from legacy to generic power management (Bjorn
Helgaas)

- Removed unused .resume_early() and .suspend_late() legacy power
management hooks (Bjorn Helgaas)

- Rearrange power management code for clarity (Rafael J. Wysocki)

- Decode power states more clearly ("4" or "D4" really refers to
"D3cold") (Bjorn Helgaas)

- Notice when reading PM Control register returns an error (~0)
instead of interpreting it as being in D3hot (Bjorn Helgaas)

- Add missing link delays required by the PCIe spec (Mika Westerberg)

Virtualization:

- Move pci_prg_resp_pasid_required() to CONFIG_PCI_PRI (Bjorn
Helgaas)

- Allow VFs to use PRI (the PF PRI is shared by the VFs, but the code
previously didn't recognize that) (Kuppuswamy Sathyanarayanan)

- Allow VFs to use PASID (the PF PASID capability is shared by the
VFs, but the code previously didn't recognize that) (Kuppuswamy
Sathyanarayanan)

- Disconnect PF and VF ATS enablement, since ATS in PFs and
associated VFs can be enabled independently (Kuppuswamy
Sathyanarayanan)

- Cache PRI and PASID capability offsets (Kuppuswamy Sathyanarayanan)

- Cache the PRI PRG Response PASID Required bit (Bjorn Helgaas)

- Consolidate ATS declarations in linux/pci-ats.h (Krzysztof
Wilczynski)

- Remove unused PRI and PASID stubs (Bjorn Helgaas)

- Removed unnecessary EXPORT_SYMBOL_GPL() from ATS, PRI, and PASID
interfaces that are only used by built-in IOMMU drivers (Bjorn
Helgaas)

- Hide PRI and PASID state restoration functions used only inside the
PCI core (Bjorn Helgaas)

- Add a DMA alias quirk for the Intel VCA NTB (Slawomir Pawlowski)

- Serialize sysfs sriov_numvfs reads vs writes (Pierre Crégut)

- Update Cavium ACS quirk for ThunderX2 and ThunderX3 (George
Cherian)

- Fix the UPDCR register address in the Intel ACS quirk (Steffen
Liebergeld)

- Unify ACS quirk implementations (Bjorn Helgaas)

Amlogic Meson host bridge driver:

- Fix meson PERST# GPIO polarity problem (Remi Pommarel)

- Add DT bindings for Amlogic Meson G12A (Neil Armstrong)

- Fix meson clock names to match DT bindings (Neil Armstrong)

- Add meson support for Amlogic G12A SoC with separate shared PHY
(Neil Armstrong)

- Add meson extended PCIe PHY functions for Amlogic G12A USB3+PCIe
combo PHY (Neil Armstrong)

- Add arm64 DT for Amlogic G12A PCIe controller node (Neil Armstrong)

- Add commented-out description of VIM3 USB3/PCIe mux in arm64 DT
(Neil Armstrong)

Broadcom iProc host bridge driver:

- Invalidate iProc PAXB address mapping before programming it
(Abhishek Shah)

- Fix iproc-msi and mvebu __iomem annotations (Ben Dooks)

Cadence host bridge driver:

- Refactor Cadence PCIe host controller to use as a library for both
host and endpoint (Tom Joseph)

Freescale Layerscape host bridge driver:

- Add layerscape LS1028a support (Xiaowei Bao)

Intel VMD host bridge driver:

- Add VMD bus 224-255 restriction decode (Jon Derrick)

- Add VMD 8086:9A0B device ID (Jon Derrick)

- Remove Keith from VMD maintainer list (Keith Busch)

Marvell ARMADA 3700 / Aardvark host bridge driver:

- Use LTSSM state to build link training flag since Aardvark doesn't
implement the Link Training bit (Remi Pommarel)

- Delay before training Aardvark link in case PERST# was asserted
before the driver probe (Remi Pommarel)

- Fix Aardvark issues with Root Control reads and writes (Remi
Pommarel)

- Don't rely on jiffies in Aardvark config access path since
interrupts may be disabled (Remi Pommarel)

- Fix Aardvark big-endian support (Grzegorz Jaszczyk)

Marvell ARMADA 370 / XP host bridge driver:

- Make mvebu_pci_bridge_emul_ops static (Ben Dooks)

Microsoft Hyper-V host bridge driver:

- Add hibernation support for Hyper-V virtual PCI devices (Dexuan
Cui)

- Track Hyper-V pci_protocol_version per-hbus, not globally (Dexuan
Cui)

- Avoid kmemleak false positive on hv hbus buffer (Dexuan Cui)

Mobiveil host bridge driver:

- Change mobiveil csr_read()/write() function names that conflict
with riscv arch functions (Kefeng Wang)

NVIDIA Tegra host bridge driver:

- Fix Tegra CLKREQ dependency programming (Vidya Sagar)

Renesas R-Car host bridge driver:

- Remove unnecessary header include from rcar (Andrew Murray)

- Tighten register index checking for rcar inbound range programming
(Marek Vasut)

- Fix rcar inbound range alignment calculation to improve packing of
multiple entries (Marek Vasut)

- Update rcar MACCTLR setting to match documentation (Yoshihiro
Shimoda)

- Clear bit 0 of MACCTLR before PCIETCTLR.CFINIT per manual
(Yoshihiro Shimoda)

- Add Marek Vasut and Yoshihiro Shimoda as R-Car maintainers (Simon
Horman)

Rockchip host bridge driver:

- Make rockchip 0V9 and 1V8 power regulators non-optional (Robin
Murphy)

Socionext UniPhier host bridge driver:

- Set uniphier to host (RC) mode always (Kunihiko Hayashi)

Endpoint drivers:

- Fix endpoint driver sign extension problem when shifting page
number to phys_addr_t (Alan Mikhak)

Misc:

- Add NumaChip SPDX header (Krzysztof Wilczynski)

- Replace EXTRA_CFLAGS with ccflags-y (Krzysztof Wilczynski)

- Remove unused includes (Krzysztof Wilczynski)

- Removed unused sysfs attribute groups (Ben Dooks)

- Remove PTM and ASPM dependencies on PCIEPORTBUS (Bjorn Helgaas)

- Add PCIe Link Control 2 register field definitions to replace magic
numbers in AMDGPU and Radeon CIK/SI (Bjorn Helgaas)

- Fix incorrect Link Control 2 Transmit Margin usage in AMDGPU and
Radeon CIK/SI PCIe Gen3 link training (Bjorn Helgaas)

- Use pcie_capability_read_word() instead of pci_read_config_word()
in AMDGPU and Radeon CIK/SI (Frederick Lawler)

- Remove unused pci_irq_get_node() Greg Kroah-Hartman)

- Make asm/msi.h mandatory and simplify PCI_MSI_IRQ_DOMAIN Kconfig
(Palmer Dabbelt, Michal Simek)

- Read all 64 bits of Switchtec part_event_bitmap (Logan Gunthorpe)

- Fix erroneous intel-iommu dependency on CONFIG_AMD_IOMMU (Bjorn
Helgaas)

- Fix bridge emulation big-endian support (Grzegorz Jaszczyk)

- Fix dwc find_next_bit() usage (Niklas Cassel)

- Fix pcitest.c fd leak (Hewenliang)

- Fix typos and comments (Bjorn Helgaas)

- Fix Kconfig whitespace errors (Krzysztof Kozlowski)"

* tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (160 commits)
PCI: Remove PCI_MSI_IRQ_DOMAIN architecture whitelist
asm-generic: Make msi.h a mandatory include/asm header
Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T"
PCI/MSI: Fix incorrect MSI-X masking on resume
PCI/MSI: Move power state check out of pci_msi_supported()
PCI/MSI: Remove unused pci_irq_get_node()
PCI: hv: Avoid a kmemleak false positive caused by the hbus buffer
PCI: hv: Change pci_protocol_version to per-hbus
PCI: hv: Add hibernation support
PCI: hv: Reorganize the code in preparation of hibernation
MAINTAINERS: Remove Keith from VMD maintainer
PCI/ASPM: Remove PCIEASPM_DEBUG Kconfig option and related code
PCI/ASPM: Add sysfs attributes for controlling ASPM link states
PCI: Fix indentation
drm/radeon: Prefer pcie_capability_read_word()
drm/radeon: Replace numbers with PCI_EXP_LNKCTL2 definitions
drm/radeon: Correct Transmit Margin masks
drm/amdgpu: Prefer pcie_capability_read_word()
PCI: uniphier: Set mode register to host mode
drm/amdgpu: Replace numbers with PCI_EXP_LNKCTL2 definitions
...

Linus Torvalds
2019-12-04 05:58:22 +0800

03 Dec, 2019

2 commits

f6c4d97b0 nvme/pci: Remove last_cq_head ... Browse Code »

We had been saving the last_cq_head seen from an interrupt so that a
polled queue wouldn't mistakenly trigger spruious interrupt detection. We
don't poll interrupt driven queues any more, so saving this value is
pointless.

Reviewed-by: Christoph Hellwig
Signed-off-by: Keith Busch

Keith Busch
2019-12-03 23:38:06 +0800
22802bf74 nvme: Namepace identification descriptor list is optional ... Browse Code »

Despite NVM Express specification 1.3 requires a controller claiming to
be 1.3 or higher implement Identify CNS 03h (Namespace Identification
Descriptor list), the driver doesn't really need this identification in
order to use a namespace. The code had already documented in comments
that we're not to consider an error to this command.

Return success if the controller provided any response to an
namespace identification descriptors command.

Fixes: 538af88ea7d9de24 ("nvme: make nvme_report_ns_ids propagate error back")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=205679
Reported-by: Ingo Brunberg
Cc: Sagi Grimberg
Cc: stable@vger.kernel.org # 5.4+
Reviewed-by: Christoph Hellwig
Signed-off-by: Keith Busch

Keith Busch
2019-12-03 05:10:00 +0800

02 Dec, 2019

1 commit

0da522107 Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground ... Browse Code »

Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
"As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need
support for time64_t.

In completely unrelated work, I spent time on cleaning up parts of
this file in the past, moving things out into drivers instead.

After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the
rest of it and move it all into drivers.

This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which
is the scsi ioctl handlers. I have patches for those as well, but they
need more testing or possibly a rewrite"

* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
scsi: sd: enable compat ioctls for sed-opal
pktcdvd: add compat_ioctl handler
compat_ioctl: move SG_GET_REQUEST_TABLE handling
compat_ioctl: ppp: move simple commands into ppp_generic.c
compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
compat_ioctl: unify copy-in of ppp filters
tty: handle compat PPP ioctls
compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
compat_ioctl: handle SIOCOUTQNSD
af_unix: add compat_ioctl support
compat_ioctl: reimplement SG_IO handling
compat_ioctl: move WDIOC handling into wdt drivers
fs: compat_ioctl: move FITRIM emulation into file systems
gfs2: add compat_ioctl support
compat_ioctl: remove unused convert_in_user macro
compat_ioctl: remove last RAID handling code
compat_ioctl: remove /dev/raw ioctl translation
compat_ioctl: remove PCI ioctl translation
compat_ioctl: remove joystick ioctl translation
...

Linus Torvalds
2019-12-02 05:46:15 +0800

27 Nov, 2019

6 commits

655e7aee1 Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T" ... Browse Code »

Since e045fa29e893 ("PCI/MSI: Fix incorrect MSI-X masking on resume") is
merged, we can revert the previous quirk now.

This reverts commit 19ea025e1d28c629b369c3532a85b3df478cc5c6.

Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887
Fixes: 19ea025e1d28 ("nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T")
Link: https://lore.kernel.org/r/20191031093408.9322-1-jian-hong@endlessm.com
Signed-off-by: Jian-Hong Pan
Signed-off-by: Bjorn Helgaas
Acked-by: Christoph Hellwig
Cc: stable@vger.kernel.org

Jian-Hong Pan
2019-11-27 03:13:14 +0800
c869e494e nvme-fc: fix double-free scenarios on hw queues ... Browse Code »

If an error occurs on one of the ios used for creating an
association, the creating routine has error paths that are
invoked by the command failure and the error paths will free
up the controller resources created to that point.

But... the io was ultimately determined by an asynchronous
completion routine that detected the error and which
unconditionally invokes the error_recovery path which calls
delete_association. Delete association deletes all outstanding
io then tears down the controller resources. So the
create_association thread can be running in parallel with
the error_recovery thread. What was seen was the LLDD received
a call to delete a queue, causing the LLDD to do a free of a
resource, then the transport called the delete queue again
causing the driver to repeat the free call. The second free
routine corrupted the allocator. The transport shouldn't be
making the duplicate call, and the delete queue is just one
of the resources being freed.

To fix, it is realized that the create_association path is
completely serialized with one command at a time. So the
failed io completion will always be seen by the create_association
path and as of the failure, there are no ios to terminate and there
is no reason to be manipulating queue freeze states, etc.
The serialized condition stays true until the controller is
transitioned to the LIVE state. Thus the fix is to change the
error recovery path to check the controller state and only
invoke the teardown path if not already in the CONNECTING state.

Reviewed-by: Himanshu Madhani
Reviewed-by: Ewan D. Milne
Signed-off-by: James Smart
Signed-off-by: Keith Busch

James Smart
2019-11-27 02:00:13 +0800
c80b36cd9 nvme: else following return is not needed ... Browse Code »

Remove unnecessary keyword in nvme_create_queue().

Reviewed-by: Christoph Hellwig
Signed-off-by: Edmund Nadolski
Signed-off-by: Keith Busch

Edmund Nadolski
2019-11-27 01:48:33 +0800
a8157ff36 nvme: add error message on mismatching controller ids ... Browse Code »

We've seen a few devices that return different controller id's to
the Fabric Connect command vs the Identify(controller) command. It's
currently hard to identify this failure by existing error messages. It
comes across as a (re)connect attempt in the transport that fails with
a -22 (-EINVAL) status. The issue is compounded by older kernels not
having the controller id check or had the identify command overwrite the
fabrics controller id value before it checked. Both resulted in cases
where the devices appeared fine until more recent kernels.

Clarify the reject by adding an error message on controller id mismatches.

Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Reviewed-by: Ewan D. Milne
Signed-off-by: James Smart
Signed-off-by: Keith Busch

James Smart
2019-11-27 01:48:33 +0800
863fbae92 nvme_fc: add module to ops template to allow module references ... Browse Code »

In nvme-fc: it's possible to have connected active controllers
and as no references are taken on the LLDD, the LLDD can be
unloaded. The controller would enter a reconnect state and as
long as the LLDD resumed within the reconnect timeout, the
controller would resume. But if a namespace on the controller
is the root device, allowing the driver to unload can be problematic.
To reload the driver, it may require new io to the boot device,
and as it's no longer connected we get into a catch-22 that
eventually fails, and the system locks up.

Fix this issue by taking a module reference for every connected
controller (which is what the core layer did to the transport
module). Reference is cleared when the controller is removed.

Acked-by: Himanshu Madhani
Reviewed-by: Christoph Hellwig
Signed-off-by: James Smart
Signed-off-by: Keith Busch

James Smart
2019-11-27 01:48:27 +0800
52e6d8ed1 nvmet-loop: Avoid preallocating big SGL for data ... Browse Code »

nvme_loop_create_io_queues() preallocates a big buffer for the IO SGL based
on SG_CHUNK_SIZE.

Modern DMA engines are often capable of dealing with very big segments so
the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
SGL allocation per command.

If a controller has lots of deep queues, preallocation for the sg list can
consume substantial amounts of memory. For nvmet-loop, nr_hw_queues can be
128 and each queue's depth 128. This means the resulting preallocation
for the data SGL is 128*128*4K = 64MB per controller.

Switch to runtime allocation for SGL for lists longer than 2 entries. This
is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
well. Runtime SGL allocation has always been the case for the legacy I/O
path so this is nothing new.

Tested-by: Chaitanya Kulkarni
Reviewed-by: Christoph Hellwig
Reviewed-by: Chaitanya Kulkarni
Reviewed-by: Max Gurtovoy
Signed-off-by: Israel Rukshin
Signed-off-by: Keith Busch

Israel Rukshin
2019-11-27 01:14:19 +0800