Eric Lee / smarc-fsl-linux-kernel

09 Oct, 2018

23 commits

4c44abf43 lightnvm: pblk: add trace events for chunk states ... Browse Code »

Introduce trace points for tracking chunk states in pblk - this is
useful for inspection of the entire state of the drive, and real handy
for both fw and pblk debugging.

Signed-off-by: Hans Holmberg
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Hans Holmberg
2018-10-09 22:25:07 +0800
43241cfe4 lightnvm: pblk: remove debug from pblk_[down/up]_page ... Browse Code »

Remove the debug only iteration within __pblk_down_page, which
then allows us to reduce the number of arguments down to pblk and
the parallel unit from the functions that calls it. Simplifying the
callers logic considerably.

Also, rename the functions pblk_[down/up]_page to
pblk_[down/up]_chunk, to communicate that it manages the write
pointer of the chunk. Note that it also protects the parallel unit
such that at most one chunk is active per parallel unit.

Signed-off-by: Matias Bjørling
Reviewed-by: Javier González
Signed-off-by: Jens Axboe

Matias Bjørling
2018-10-09 22:25:07 +0800
765462fa4 lightnvm: pblk: fix write amplificiation calculation ... Browse Code »

When the user data counter exceeds 32 bits, the write amplification
calculation does not provide the right value. Fix this by using
div64_u64 in stead of div64.

Fixes: 76758390f83e ("lightnvm: pblk: export write amplification counters to sysfs")
Signed-off-by: Hans Holmberg
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Hans Holmberg
2018-10-09 22:25:07 +0800
ea1d24bc3 lightnvm: pblk: fix up prints in pblk_read_check_rand ... Browse Code »

The prefix when printing ppas in pblk_read_check_rand should be "rnd"
not "seq", so fix this so we can differentiate between lba missmatches
in random and sequential reads. Also change the print order so
we align with pblk_read_check_seq, printing read lba first.

Signed-off-by: Hans Holmberg
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Hans Holmberg
2018-10-09 22:25:07 +0800
e99e802fc lightnvm: pblk: remove unused parameters in pblk_up_rq ... Browse Code »

The parameters nr_ppas and ppa_list are not used, so remove them.

Signed-off-by: Hans Holmberg
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Hans Holmberg
2018-10-09 22:25:07 +0800
53d82db69 lightnvm: pblk: allocate line map bitmaps using a mempool ... Browse Code »

Line map bitmap allocations are fairly large and can fail. Allocation
failures are fatal to pblk, stopping the write pipeline. To avoid this,
allocate the bitmaps using a mempool instead.

Mempool allocations never fail if called from a process context,
and pblk *should* only allocate map bitmaps in process context,
but keep the failure handling for robustness sake.

Signed-off-by: Hans Holmberg
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Hans Holmberg
2018-10-09 22:25:07 +0800
d68a93440 lightnvm: introduce nvm_rq_to_ppa_list ... Browse Code »

There is a number of places in the lightnvm subsystem where the user
iterates over the ppa list. Before iterating, the user must know if it
is a single or multiple LBAs due to vector commands using either the
nvm_rq ->ppa_addr or ->ppa_list fields on command submission, which
leads to open-coding the if/else statement.

Instead of having multiple if/else's, move it into a function that can
be called by its users.

A nice side effect of this cleanup is that this patch fixes up a
bunch of cases where we don't consider the single-ppa case in pblk.

Signed-off-by: Hans Holmberg
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Hans Holmberg
2018-10-09 22:25:07 +0800
9cc85bc76 lightnvm: pblk: guarantee emeta on line close ... Browse Code »

If a line is recovered from open chunks, the memory structures for
emeta have not necessarily been properly set on line initialization.
When closing a line, make sure that emeta is consistent so that the line
can be recovered on the fast path on next reboot.

Also, remove a couple of empty lines at the end of the function.

Signed-off-by: Javier González
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Javier González
2018-10-09 22:25:06 +0800
7a7d6f9b4 lightnvm: pblk: remove unused variable. ... Browse Code »

Removed unused struct ppa_addr variable.

Signed-off-by: Javier González
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Javier González
2018-10-09 22:25:06 +0800
2e696f909 lightnvm: pblk: fix comment typo ... Browse Code »

Fix comment typo Decrese -> Decrease

Signed-off-by: Javier González
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Javier González
2018-10-09 22:25:06 +0800
cb21665c8 lightnvm: pblk: improve line helpers ... Browse Code »

The current helper to obtain a line from a ppa returns the line id,
which requires its users to explicitly retrieve the pointer to the line
with the id.

Make 2 different helpers: one returning the line id and one returning
the line directly.

Signed-off-by: Javier González
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Javier González
2018-10-09 22:25:06 +0800
2cf99bbd1 lightnvm: pblk: add helpers for chunk addresses ... Browse Code »

Implement helpers to go from ppas to a chunk within a line and an
address within a chunk.

These helpers will be used on the patches adding trace support in pblk,
which will be sent in this window.

Signed-off-by: Javier González
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Javier González
2018-10-09 22:25:06 +0800
ae14cc044 lightnvm: pblk: refactor put line fn on read completion ... Browse Code »

The read completion path uses the put_line variable to decide whether
the reference on a line should be released. The function name used for
that is pblk_read_put_rqd_kref, which could lead one to believe that it
is the rqd that is releasing the reference, while it is the line
reference that is put.

Rename and also split the function in two to account for either rqd or
single ppa callers and move it to core, such that it later can be used
in the write path as well.

Signed-off-by: Matias Bjørling
Reviewed-by: Javier González
Reviewed-by: Heiner Litz
Signed-off-by: Jens Axboe

Matias Bjørling
2018-10-09 22:25:06 +0800
d20be90ae lightnvm: pblk: remove size and out of bounds read check ... Browse Code »

The I/O size and capacity checks are already done by the block layer.

Signed-off-by: Matias Bjørling
Reviewed-by: Javier González
Signed-off-by: Jens Axboe

Matias Bjørling
2018-10-09 22:25:06 +0800
8bbd45d02 lightnvm: pblk: fix incorrect min_write_pgs ... Browse Code »

The calculation of pblk->min_write_pgs should only use the optimal
write size attribute provided by the drive, it does not correlate to
the memory page size of the system, which can be smaller or larger
than the LBA size reported.

Signed-off-by: Matias Bjørling
Reviewed-by: Javier González
Signed-off-by: Jens Axboe

Matias Bjørling
2018-10-09 22:25:06 +0800
afdc23c91 lightnvm: pblk: unify vector max req constants ... Browse Code »

Both NVM_MAX_VLBA and PBLK_MAX_REQ_ADDRS define how many LBAs that
are available in a vector command. pblk uses them interchangeably
in its implementation. Use NVM_MAX_VLBA as the main one and remove
usages of PBLK_MAX_REQ_ADDRS.

Also remove the power representation that only has one user, and
instead calculate it at runtime.

Signed-off-by: Matias Bjørling
Reviewed-by: Javier González
Signed-off-by: Jens Axboe

Matias Bjørling
2018-10-09 22:25:06 +0800
aff3fb18f lightnvm: move bad block and chunk state logic to core ... Browse Code »

pblk implements two data paths for recovery line state. One for 1.2
and another for 2.0, instead of having pblk implement these, combine
them in the core to reduce complexity and make available to other
targets.

The new interface will adhere to the 2.0 chunk definition,
including managing open chunks with an active write pointer. To provide
this interface, a 1.2 device recovers the state of the chunks by
manually detecting if a chunk is either free/open/close/offline, and if
open, scanning the flash pages sequentially to find the next writeable
page. This process takes on average ~10 seconds on a device with 64 dies,
1024 blocks and 60us read access time. The process can be parallelized
but is left out for maintenance simplicity, as the 1.2 specification is
deprecated. For 2.0 devices, the logic is maintained internally in the
drive and retrieved through the 2.0 interface.

Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Matias Bjørling
2018-10-09 22:25:06 +0800
d8adaa3b8 lightnvm: pblk: fix race condition on metadata I/O ... Browse Code »

In pblk, when a new line is allocated, metadata for the previously
written line is scheduled. This is done through a fixed memory region
that is shared through time and contexts across different lines and
therefore protected by a lock. Unfortunately, this lock is not properly
covering all the metadata used for sharing this memory regions,
resulting in a race condition.

This patch fixes this race condition by protecting this metadata
properly.

Fixes: dd2a43437337 ("lightnvm: pblk: sched. metadata on write thread")
Signed-off-by: Javier González
Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Javier González
2018-10-09 22:25:06 +0800
656e33ca3 lightnvm: move device L2P detection to core ... Browse Code »

A 1.2 device is able to manage the logical to physical mapping
table internally or leave it to the host.

A target only supports one of those approaches, and therefore must
check on initialization. Move this check to core to avoid each target
implement the check.

Signed-off-by: Matias Bjørling
Signed-off-by: Jens Axboe

Matias Bjørling
2018-10-09 22:25:06 +0800
4b5d56edb lightnvm: pblk: fix rqd.error return value in pblk_blk_erase_sync ... Browse Code »

rqd.error is masked by the return value of pblk_submit_io_sync.
The rqd structure is then passed on to the end_io function, which
assumes that any error should lead to a chunk being marked
offline/bad. Since the pblk_submit_io_sync can fail before the
command is issued to the device, the error value maybe not correspond
to a media failure, leading to chunks being immaturely retired.

Also, the pblk_blk_erase_sync function prints an error message in case
the erase fails. Since the caller prints an error message by itself,
remove the error message in this function.

Signed-off-by: Matias Bjørling
Reviewed-by: Javier González
Reviewed-by: Hans Holmberg
Signed-off-by: Jens Axboe

Matias Bjørling
2018-10-09 22:25:06 +0800
d7b680167 lightnvm: combine 1.2 and 2.0 command flags ... Browse Code »

Add nvm_set_flags helper to enable core to appropriately
set the command flags for read/write/erase depending on which version
a drive supports.

The flags arguments can be distilled into the access hint,
scrambling, and program/erase suspend. Replace the access hint with
a "is_seq" parameter. The rest of the flags are dependent on the
command opcode, which is trivial to detect and set.

Signed-off-by: Matias Bjørling
Reviewed-by: Javier González
Signed-off-by: Jens Axboe

Matias Bjørling
2018-10-09 22:25:05 +0800
73569e110 lightnvm: remove dependencies on BLK_DEV_NVME and PCI ... Browse Code »

No need to force NVMe device driver to be compiled in if the
lightnvm subsystem is selected. Also no need for PCI to be selected
as well, as it would be selected by the device driver that hooks into
the subsystem.

Signed-off-by: Matias Bjørling
Reviewed-by: Javier González
Signed-off-by: Jens Axboe

Matias Bjørling
2018-10-09 22:25:05 +0800
36e765392 blk-mq: complete req in softirq context in case of single queue ... Browse Code »

Lot of controllers may have only one irq vector for completing IO
request. And usually affinity of the only irq vector is all possible
CPUs, however, on most of ARCH, there may be only one specific CPU
for handling this interrupt.

So if all IOs are completed in hardirq context, it is inevitable to
degrade IO performance because of increased irq latency.

This patch tries to address this issue by allowing to complete request
in softirq context, like the legacy IO path.

IOPS is observed as ~13%+ in the following randread test on raid0 over
virtio-scsi.

mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi

fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k

Cc: Dongli Zhang
Cc: Zach Marano
Cc: Christoph Hellwig
Cc: Bart Van Assche
Cc: Jianchao Wang
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-10-09 00:50:43 +0800

08 Oct, 2018

15 commits

3a646fd77 bcache: panic fix for making cache device ... Browse Code »

when the nbuckets of cache device is smaller than 1024, making cache
device will trigger BUG_ON in kernel, add a condition to avoid this.

Reported-by: nitroxis
Signed-off-by: Dongbo Cao
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Dongbo Cao
2018-10-08 22:19:59 +0800
f6027bca9 bcache: split combined if-condition code into separate ones ... Browse Code »

Split the combined '||' statements in if() check, to make the code easier
for debug.

Signed-off-by: Dongbo Cao
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Dongbo Cao
2018-10-08 22:19:57 +0800
8792099f9 bcache: use MAX_CACHES_PER_SET instead of magic number 8 in __bch_bucket_alloc_set ... Browse Code »

Current cache_set has MAX_CACHES_PER_SET caches most, and the macro
is used for
"
struct cache *cache_by_alloc[MAX_CACHES_PER_SET];
"
in the define of struct cache_set.

Use MAX_CACHES_PER_SET instead of magic number 8 in
__bch_bucket_alloc_set.

Signed-off-by: Shenghui Wang
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Shenghui Wang
2018-10-08 22:19:56 +0800
149d0efad bcache: replace hard coded number with BUCKET_GC_GEN_MAX ... Browse Code »

In extents.c:bch_extent_bad(), number 96 is used as parameter to call
btree_bug_on(). The purpose is to check whether stale gen value exceeds
BUCKET_GC_GEN_MAX, so it is better to use macro BUCKET_GC_GEN_MAX to
make the code more understandable.

Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Coly Li
2018-10-08 22:19:55 +0800
91bafdf08 bcache: remove useless parameter of bch_debug_init() ... Browse Code »

Parameter "struct kobject *kobj" in bch_debug_init() is useless,
remove it in this patch.

Signed-off-by: Dongbo Cao
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Dongbo Cao
2018-10-08 22:19:53 +0800
3fd3c5c02 bcache: remove unused bch_passthrough_cache ... Browse Code »

struct kmem_cache *bch_passthrough_cache is not used in
bcache code. Remove it.

Signed-off-by: Shenghui Wang
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Shenghui Wang
2018-10-08 22:19:52 +0800
46010141d bcache: recal cached_dev_sectors on detach ... Browse Code »

Recal cached_dev_sectors on cached_dev detached, as recal done on
cached_dev attached.

Update the cached_dev_sectors before bcache_device_detach called
as bcache_device_detach will set bcache_device->c to NULL.

Signed-off-by: Shenghui Wang
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Shenghui Wang
2018-10-08 22:19:50 +0800
2d6cb6edd bcache: fix miss key refill->end in writeback ... Browse Code »

refill->end record the last key of writeback, for example, at the first
time, keys (1,128K) to (1,1024K) are flush to the backend device, but
the end key (1,1024K) is not included, since the bellow code:
if (bkey_cmp(k, refill->end) >= 0) {
ret = MAP_DONE;
goto out;
}
And in the next time when we refill writeback keybuf again, we searched
key start from (1,1024K), and got a key bigger than it, so the key
(1,1024K) missed.
This patch modify the above code, and let the end key to be included to
the writeback key buffer.

Signed-off-by: Tang Junhui
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Tang Junhui
2018-10-08 22:19:48 +0800
7567c2a2a bcache: Populate writeback_rate_minimum attribute ... Browse Code »

Forgot to include the maintainers with my first email.

Somewhere between Michael Lyle's original
"bcache: PI controller for writeback rate V2" patch dated 07 Sep 2017
and 1d316e6 bcache: implement PI controller for writeback rate,
the mapping of the writeback_rate_minimum attribute was dropped.

Re-add the missing sysfs writeback_rate_minimum attribute mapping to
"allow the user to specify a minimum rate at which dirty blocks are
retired."

Fixes: 1d316e6 ("bcache: implement PI controller for writeback rate")
Signed-off-by: Ben Peddell
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Ben Peddell
2018-10-08 22:19:46 +0800
2e17a262a bcache: correct dirty data statistics ... Browse Code »

When bcache device is clean, dirty keys may still exist after
journal replay, so we need to count these dirty keys even
device in clean status, otherwise after writeback, the amount
of dirty data would be incorrect.

Signed-off-by: Tang Junhui
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Tang Junhui
2018-10-08 22:19:45 +0800
4516da427 bcache: fix typo in code comments of closure_return_with_destructor() ... Browse Code »

The code comments of closure_return_with_destructor() in closure.h makrs
function name as closure_return(). This patch fixes this type with the
correct name - closure_return_with_destructor.

Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Coly Li
2018-10-08 22:19:43 +0800
dd0c91793 bcache: fix ioctl in flash device ... Browse Code »

When doing ioctl in flash device, it will call ioctl_dev() in super.c,
then we should not to get cached device since flash only device has
no backend device. This patch just move the jugement dc->io_disable
to cached_dev_ioctl() to make ioctl in flash device correctly.

Fixes: 0f0709e6bfc3c ("bcache: stop bcache device when backing device is offline")
Signed-off-by: Tang Junhui
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Tang Junhui
2018-10-08 22:19:42 +0800
752f66a75 bcache: use REQ_PRIO to indicate bio for metadata ... Browse Code »

In cached_dev_cache_miss() and check_should_bypass(), REQ_META is used
to check whether a bio is for metadata request. REQ_META is used for
blktrace, the correct REQ_ flag should be REQ_PRIO. This flag means the
bio should be prior to other bio, and frequently be used to indicate
metadata io in file system code.

This patch replaces REQ_META with correct flag REQ_PRIO.

CC Adam Manzanares because he explains to me what REQ_PRIO is for.

Signed-off-by: Coly Li
Cc: Adam Manzanares
Signed-off-by: Jens Axboe

Coly Li
2018-10-08 22:19:40 +0800
502b29156 bcache: trace missed reading by cache_missed ... Browse Code »

Missed reading IOs are identified by s->cache_missed, not the
s->cache_miss, so in trace_bcache_read() using trace_bcache_read
to identify whether the IO is missed or not.

Signed-off-by: Tang Junhui
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Tang Junhui
2018-10-08 22:19:39 +0800
7a55948d3 bcache: account size of buckets used in uuid write to ca->meta_sectors_written ... Browse Code »

UUIDs are considered as metadata. __uuid_write should add the number
of buckets (in sectors) written to disk to ca->meta_sectors_written.
Currently only 1 bucket is used in uuid write.

Steps to test:
1) create a fresh backing device and a fresh cache device separately.
The backing device didn't attach to any cache set.
2) cd /sys/block//bcache
cat metadata_written // record the output value
cat bucket_size
3) attach the backing device to cache set
4) cat metadata_written
The output value is almost the same as the value in step 2
before the change.
After the change, the value is bigger about 1 bucket size.

Signed-off-by: Shenghui Wang
Reviewed-by: Tang Junhui
Signed-off-by: Coly Li
Signed-off-by: Jens Axboe

Shenghui Wang
2018-10-08 22:19:37 +0800

05 Oct, 2018

2 commits

6d8623a71 blk-mq-debugfs: Also show requests that have not yet been started ... Browse Code »

When debugging e.g. the SCSI timeout handler it is important that
requests that have not yet been started or that already have
completed are also reported through debugfs.

Cc: Christoph Hellwig
Cc: Ming Lei
Cc: Martin K. Petersen
Reviewed-by: Hannes Reinecke
Reviewed-by: Johannes Thumshirn
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2018-10-05 22:16:58 +0800
4f5735f38 Merge branch 'nvme-4.20' of git://git.infradead.org/nvme into for-4.20/block ... Browse Code »

Pull NVMe updates from Christoph:

"A relatively boring merge window:

- better AEN tracing (Chaitanya)
- NUMA aware PCIe multipathing (me)
- RDMA workqueue fixes (Sagi)
- better bio usage in the target (Sagi)
- FC rework for target removal (James)
- better multipath handling of ->queue_rq failures (James)
- various cleanups (Milan)"

* 'nvme-4.20' of git://git.infradead.org/nvme:
nvmet-rdma: use a private workqueue for delete
nvme: take node locality into account when selecting a path
nvmet: don't split large I/Os unconditionally
nvme: call nvme_complete_rq when nvmf_check_ready fails for mpath I/O
nvme-core: add async event trace helper
nvme_fc: add 'nvme_discovery' sysfs attribute to fc transport device
nvmet_fc: support target port removal with nvmet layer
nvme-fc: fix for a minor typos
nvmet: remove redundant module prefix
nvme: fix typo in nvme_identify_ns_descs

Jens Axboe
2018-10-05 22:15:12 +0800