Eric Lee / smarc-fsl-linux-kernel

10 Jul, 2019

1 commit

3b99107f0 Merge tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:
"This is the main block updates for 5.3. Nothing earth shattering or
major in here, just fixes, additions, and improvements all over the
map. This contains:

- Series of documentation fixes (Bart)

- Optimization of the blk-mq ctx get/put (Bart)

- null_blk removal race condition fix (Bob)

- req/bio_op() cleanups (Chaitanya)

- Series cleaning up the segment accounting, and request/bio mapping
(Christoph)

- Series cleaning up the page getting/putting for bios (Christoph)

- block cgroup cleanups and moving it to where it is used (Christoph)

- block cgroup fixes (Tejun)

- Series of fixes and improvements to bcache, most notably a write
deadlock fix (Coly)

- blk-iolatency STS_AGAIN and accounting fixes (Dennis)

- Series of improvements and fixes to BFQ (Douglas, Paolo)

- debugfs_create() return value check removal for drbd (Greg)

- Use struct_size(), where appropriate (Gustavo)

- Two lighnvm fixes (Heiner, Geert)

- MD fixes, including a read balance and corruption fix (Guoqing,
Marcos, Xiao, Yufen)

- block opal shadow mbr additions (Jonas, Revanth)

- sbitmap compare-and-exhange improvemnts (Pavel)

- Fix for potential bio->bi_size overflow (Ming)

- NVMe pull requests:
- improved PCIe suspent support (Keith Busch)
- error injection support for the admin queue (Akinobu Mita)
- Fibre Channel discovery improvements (James Smart)
- tracing improvements including nvmetc tracing support (Minwoo Im)
- misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
Kulkarni)"

- Various little fixes and improvements to drivers and core"

* tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
blk-iolatency: fix STS_AGAIN handling
block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
blk-mq: simplify blk_mq_make_request()
blk-mq: remove blk_mq_put_ctx()
sbitmap: Replace cmpxchg with xchg
block: fix .bi_size overflow
block: sed-opal: check size of shadow mbr
block: sed-opal: ioctl for writing to shadow mbr
block: sed-opal: add ioctl for done-mark of shadow mbr
block: never take page references for ITER_BVEC
direct-io: use bio_release_pages in dio_bio_complete
block_dev: use bio_release_pages in bio_unmap_user
block_dev: use bio_release_pages in blkdev_bio_end_io
iomap: use bio_release_pages in iomap_dio_bio_end_io
block: use bio_release_pages in bio_map_user_iov
block: use bio_release_pages in bio_unmap_user
block: optionally mark pages dirty in bio_release_pages
block: move the BIO_NO_PAGE_REF check into bio_release_pages
block: skd_main.c: Remove call to memset after dma_alloc_coherent
block: mtip32xx: Remove call to memset after dma_alloc_coherent
...

Linus Torvalds
2019-07-10 01:45:06 +0800

09 Jul, 2019

1 commit

92c1d6522 Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"Documentation updates and the addition of cgroup_parse_float() which
will be used by new controllers including blk-iocost"

* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup-v1: convert docs to ReST and rename to *.rst
cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS
cgroup: add cgroup_parse_float()

Linus Torvalds
2019-07-09 12:35:12 +0800

07 Jul, 2019

1 commit

7e41c3c9b blk-mq: fix up placement of debugfs directory of queue files ... Browse Code »

When the blk-mq debugfs file creation logic was "cleaned up" it was
cleaned up too much, causing the queue file to not be created in the
correct location. Turns out the check for the directory being present
is needed as if that has not happened yet, the files should not be
created, and the function will be called later on in the initialization
code so that the files can be created in the correct location.

Fixes: 6cfc0081b046 ("blk-mq: no need to check return value of debugfs_create functions")
Reported-by: Stephen Rothwell
Cc: linux-block@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman
Signed-off-by: Jens Axboe

Greg Kroah-Hartman
2019-07-07 00:07:38 +0800

06 Jul, 2019

1 commit

c9b3007fe blk-iolatency: fix STS_AGAIN handling ... Browse Code »

The iolatency controller is based on rq_qos. It increments on
rq_qos_throttle() and decrements on either rq_qos_cleanup() or
rq_qos_done_bio(). a3fb01ba5af0 fixes the double accounting issue where
blk_mq_make_request() may call both rq_qos_cleanup() and
rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the
double decrement.

The above works upstream as the only way we can get STS_AGAIN is from
blk_mq_get_request() failing. The STS_AGAIN handling isn't a real
problem as bio_endio() skipping only happens on reserved tag allocation
failures which can only be caused by driver bugs and already triggers
WARN.

However, the fix creates a not so great dependency on how STS_AGAIN can
be propagated. Internally, we (Facebook) carry a patch that kills read
ahead if a cgroup is io congested or a fatal signal is pending. This
combined with chained bios progagate their bi_status to the parent is
not already set can can cause the parent bio to not clean up properly
even though it was successful. This consequently leaks the inflight
counter and can hang all IOs under that blkg.

To nip the adverse interaction early, this removes the rq_qos_cleanup()
callback in iolatency in favor of cleaning up always on the
rq_qos_done_bio() path.

Fixes: a3fb01ba5af0 ("blk-iolatency: only account submitted bios")
Debugged-by: Tejun Heo
Debugged-by: Josef Bacik
Signed-off-by: Dennis Zhou
Signed-off-by: Jens Axboe

Dennis Zhou
2019-07-06 05:14:00 +0800

03 Jul, 2019

3 commits

d665e12aa block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES ... Browse Code »

Fix a regression introduced when removing bi_phys_segments for Write Zeroes
requests, which need to have a segment count of zero, as they don't have a
payload.

Fixes: 14ccb66b3f58 ("block: remove the bi_phys_segments field in struct bio")
Reported-by: Jens Axboe
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-07-03 21:20:40 +0800
970d168de blk-mq: simplify blk_mq_make_request() ... Browse Code »

Move the blk_mq_bio_to_request() call in front of the if-statement.

Cc: Hannes Reinecke
Cc: Omar Sandoval
Reviewed-by: Minwoo Im
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-07-03 11:03:38 +0800
c05f42206 blk-mq: remove blk_mq_put_ctx() ... Browse Code »

No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
on preemption being disabled for its correctness. Since removing the CPU
preemption calls does not measurably affect performance, simplify the
blk-mq code by removing the blk_mq_put_ctx() function and also by not
disabling preemption in blk_mq_get_ctx().

Cc: Hannes Reinecke
Cc: Omar Sandoval
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-07-03 11:03:27 +0800

01 Jul, 2019

2 commits

79d08f89b block: fix .bi_size overflow ... Browse Code »

'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1
bytes.

Before 07173c3ec276 ("block: enable multipage bvecs"), one bio can
include very limited pages, and usually at most 256, so the fs bio
size won't be bigger than 1M bytes most of times.

Since we support multi-page bvec, in theory one fs bio really can
be added > 1M pages, especially in case of hugepage, or big writeback
with too many dirty pages. Then there is chance in which .bi_size
is overflowed.

Fixes this issue by using bio_full() to check if the added segment may
overflow .bi_size.

Cc: Liu Yiding
Cc: kernel test robot
Cc: "Darrick J. Wong"
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 07173c3ec276 ("block: enable multipage bvecs")
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-07-01 22:18:54 +0800
5be1f9d82 Merge tag 'v5.2-rc6' into for-5.3/block ... Browse Code »

Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
fix. Otherwise we end up having conflicts with future patches between
for-5.3/block and master that touch this area. In particular, it makes
the bio_full() fix hard to backport to stable.

* tag 'v5.2-rc6': (482 commits)
Linux 5.2-rc6
Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
Bluetooth: Fix regression with minimum encryption key size alignment
tcp: refine memory limit test in tcp_fragment()
x86/vdso: Prevent segfaults due to hoisted vclock reads
SUNRPC: Fix a credential refcount leak
Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
net :sunrpc :clnt :Fix xps refcount imbalance on the error path
NFS4: Only set creation opendata if O_CREAT
ARM: 8867/1: vdso: pass --be8 to linker if necessary
KVM: nVMX: reorganize initial steps of vmx_set_nested_state
KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
habanalabs: use u64_to_user_ptr() for reading user pointers
nfsd: replace Jeff by Chuck as nfsd co-maintainer
inet: clear num_timeout reqsk_alloc()
PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
net: mvpp2: debugfs: Add pmap to fs dump
ipv6: Default fib6_type to RTN_UNICAST when not set
net: hns3: Fix inconsistent indenting
net/af_iucv: always register net_device notifier
...

Jens Axboe
2019-07-01 22:16:08 +0800

30 Jun, 2019

3 commits

ff91064ea block: sed-opal: check size of shadow mbr ... Browse Code »

Check whether the shadow mbr does fit in the provided space on the
target. Also a proper firmware should handle this case and return an
error we may prevent problems or even damage with crappy firmwares.

Signed-off-by: Jonas Rabenstein
Signed-off-by: David Kozub
Reviewed-by: Scott Bauer
Reviewed-by: Jon Derrick
Signed-off-by: Jens Axboe

Jonas Rabenstein
2019-06-30 00:34:08 +0800
a9b25b4cf block: sed-opal: ioctl for writing to shadow mbr ... Browse Code »

Allow modification of the shadow mbr. If the shadow mbr is not marked as
done, this data will be presented read only as the device content. Only
after marking the shadow mbr as done and unlocking a locking range the
actual content is accessible.

Co-authored-by: David Kozub
Signed-off-by: Jonas Rabenstein
Signed-off-by: David Kozub
Reviewed-by: Scott Bauer
Reviewed-by: Jon Derrick
Signed-off-by: Jens Axboe

Jonas Rabenstein
2019-06-30 00:33:57 +0800
c98884434 block: sed-opal: add ioctl for done-mark of shadow mbr ... Browse Code »

Enable users to mark the shadow mbr as done without completely
deactivating the shadow mbr feature. This may be useful on reboots,
when the power to the disk is not disconnected in between and the shadow
mbr stores the required boot files. Of course, this saves also the
(few) commands required to enable the feature if it is already enabled
and one only wants to mark the shadow mbr as done.

Co-authored-by: David Kozub
Signed-off-by: Jonas Rabenstein
Signed-off-by: David Kozub
Reviewed-by: Christoph Hellwig
Reviewed by: Scott Bauer
Reviewed-by: Jon Derrick
Signed-off-by: Jens Axboe

Jonas Rabenstein
2019-06-30 00:31:33 +0800

29 Jun, 2019

7 commits

b62074307 block: never take page references for ITER_BVEC ... Browse Code »

If we pass pages through an iov_iter we always already have a reference
in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take
reference to pages by default for bvec backed iov_iters.

Reviewed-by: Minwoo Im
Reviewed-by: Johannes Thumshirn
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-29 23:47:32 +0800
506e07984 block: use bio_release_pages in bio_map_user_iov ... Browse Code »

Use bio_release_pages instead of open coding it.

Reviewed-by: Minwoo Im
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-29 23:47:31 +0800
163cc2d3c block: use bio_release_pages in bio_unmap_user ... Browse Code »

Use bio_release_pages instead of open coding it.

Reviewed-by: Chaitanya Kulkarni
Reviewed-by: Minwoo Im
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-29 23:47:31 +0800
d241a95f3 block: optionally mark pages dirty in bio_release_pages ... Browse Code »

A lot of callers of bio_release_pages also want to mark the released
pages as dirty. Add a mark_dirty parameter to avoid a second
relatively expensive bio_for_each_segment_all loop.

Reviewed-by: Minwoo Im
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-29 23:47:31 +0800
b2d0d9913 block: move the BIO_NO_PAGE_REF check into bio_release_pages ... Browse Code »

Move the BIO_NO_PAGE_REF check into bio_release_pages instead of
duplicating it in both callers.

Also make the function available outside of bio.c so that we can
reuse it in other direct I/O implementations.

Reviewed-by: Minwoo Im
Reviewed-by: Johannes Thumshirn
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-29 23:47:31 +0800
15ddffcb3 block: sed-opal: "Never True" conditions ... Browse Code »

'who' an unsigned variable in stucture opal_session_info
can never be lesser than zero. Hence, the condition
"who < OPAL_ADMIN1" can never be true.

Signed-off-by: Revanth Rajashekar
Signed-off-by: Jens Axboe

Revanth Rajashekar
2019-06-29 23:40:31 +0800
5e4c7cf60 block: sed-opal: PSID reverttper capability ... Browse Code »

PSID is a 32 character password printed on the drive label,
to prove its physical access. This PSID reverttper function
is very useful to regain the control over the drive when it
is locked and the user can no longer access it because of some
failures. However, *all the data on the drive is completely
erased*. This method is advisable only when the user is exhausted
of all other recovery methods.

PSID capabilities are described in:
https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage-Opal_Feature_Set_PSID_v1.00_r1.00.pdf

Signed-off-by: Revanth Rajashekar
Signed-off-by: Jens Axboe

Revanth Rajashekar
2019-06-29 23:40:30 +0800

28 Jun, 2019

1 commit

dbc3117d4 block, bfq: NULL out the bic when it's no longer valid ... Browse Code »

In reboot tests on several devices we were seeing a "use after free"
when slub_debug or KASAN was enabled. The kernel complained about:

Unable to handle kernel paging request at virtual address 6b6b6c2b

...which is a classic sign of use after free under slub_debug. The
stack crawl in kgdb looked like:

0 test_bit (addr=, nr=)
1 bfq_bfqq_busy (bfqq=)
2 bfq_select_queue (bfqd=)
3 __bfq_dispatch_request (hctx=)
4 bfq_dispatch_request (hctx=)
5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440)
6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440)
7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440)
8 0xc0568d94 in blk_mq_run_work_fn (work=)
9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480)
10 0xc024cff4 in worker_thread (__worker=0xec6d4640)

Digging in kgdb, it could be found that, though bfqq looked fine,
bfqq->bic had been freed.

Through further digging, I postulated that perhaps it is illegal to
access a "bic" (AKA an "icq") after bfq_exit_icq() had been called
because the "bic" can be freed at some point in time after this call
is made. I confirmed that there certainly were cases where the exact
crashing code path would access the "bic" after bfq_exit_icq() had
been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and
saw that the bic was 0x7 at the time of the crash.

To understand a bit more about why this crash was fairly uncommon (I
saw it only once in a few hundred reboots), you can see that much of
the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't
access the ->bic anymore. The only case it doesn't is if
bfq_put_queue() sees a reference still held.

However, even in the case when bfqq isn't freed, the crash is still
rare. Why? I tracked what happened to the "bic" after the exit
routine. It doesn't get freed right away. Rather,
put_io_context_active() eventually called put_io_context() which
queued up freeing on a workqueue. The freeing then actually happened
later than that through call_rcu(). Despite all these delays, some
extra debugging showed that all the hoops could be jumped through in
time and the memory could be freed causing the original crash. Phew!

To make a long story short, assuming it truly is illegal to access an
icq after the "exit_icq" callback is finished, this patch is needed.

Cc: stable@vger.kernel.org
Reviewed-by: Paolo Valente
Signed-off-by: Douglas Anderson
Signed-off-by: Jens Axboe

Douglas Anderson
2019-06-28 21:44:19 +0800

27 Jun, 2019

2 commits

a5b47a40b block: Remove unused code ... Browse Code »

bio_flush_dcache_pages() is unused. Remove it.

Reviewed-by: Christoph Hellwig
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2019-06-27 21:34:25 +0800
2b50f230f block, bfq: Init saved_wr_start_at_switch_to_srt in unlikely case ... Browse Code »

Some debug code suggested by Paolo was tripping when I did reboot
stress tests. Specifically in bfq_bfqq_resume_state()
"bic->saved_wr_start_at_switch_to_srt" was later than the current
value of "jiffies". A bit of debugging showed that
"bic->saved_wr_start_at_switch_to_srt" was actually 0 and a bit more
debugging showed that was because we had run through the "unlikely"
case in the bfq_bfqq_save_state() function.

Let's init "saved_wr_start_at_switch_to_srt" in the unlikely case to
something sane.

NOTE: this fixes no known real-world errors.

Reviewed-by: Paolo Valente
Reviewed-by: Guenter Roeck
Signed-off-by: Douglas Anderson
Signed-off-by: Jens Axboe

Douglas Anderson
2019-06-27 06:25:35 +0800

26 Jun, 2019

1 commit

e6feaf215 block, bfq: fix operator in BFQQ_TOTALLY_SEEKY ... Browse Code »

By mistake, there is a '&' instead of a '==' in the definition of the
macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with
the correct one.

Fixes: 7074f076ff15 ("block, bfq: do not tag totally seeky queues as soft rt")
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Paolo Valente
2019-06-26 01:38:08 +0800

25 Jun, 2019

7 commits

3726112ec block, bfq: re-schedule empty queues if they deserve I/O plugging ... Browse Code »

Consider, on one side, a bfq_queue Q that remains empty while in
service, and, on the other side, the pending I/O of bfq_queues that,
according to their timestamps, have to be served after Q. If an
uncontrolled amount of I/O from the latter bfq_queues were dispatched
while Q is waiting for its new I/O to arrive, then Q's bandwidth
guarantees would be violated. To prevent this, I/O dispatch is plugged
until Q receives new I/O (except for a properly controlled amount of
injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging,
for the following reason.

Preemption is performed in two steps. First, Q is expired and
re-scheduled. Second, the new bfq_queue to serve is chosen. The first
step is needed by the second, as the second can be performed only
after Q's timestamps have been properly updated (done in the
expiration step), and Q has been re-queued for service. This
dependency is a consequence of the way how BFQ's scheduling algorithm
is currently implemented.

But Q is not re-scheduled at all in the first step, because Q is
empty. As a consequence, an uncontrolled amount of I/O may be
dispatched until Q becomes non empty again. This breaks Q's service
guarantees.

This commit addresses this issue by re-scheduling Q even if it is
empty. This in turn breaks the assumption that all scheduled queues
are non empty. Then a few extra checks are now needed.

Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Paolo Valente
2019-06-25 23:07:35 +0800
96a291c38 block, bfq: preempt lower-weight or lower-priority queues ... Browse Code »

BFQ enqueues the I/O coming from each process into a separate
bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
served for at most timeout_sync milliseconds (default: 125 ms). This
service scheme is prone to the following inaccuracy.

While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may
receive I/O, and, according to BFQ's scheduling policy, may become the
right bfq_queue to serve, in place of the currently in-service
bfq_queue. In this respect, postponing the service of Q2 to after the
service of Q1 finishes may delay the completion of Q2's I/O, compared
with an ideal service in which all non-empty bfq_queues are served in
parallel, and every non-empty bfq_queue is served at a rate
proportional to the bfq_queue's weight. This additional delay is equal
at most to the time Q1 may unjustly remain in service before switching
to Q2.

If Q1 and Q2 have the same weight, then this time is most likely
negligible compared with the completion time to be guaranteed to Q2's
I/O. In addition, first, one of the reasons why BFQ may want to serve
Q1 for a while is that this boosts throughput and, second, serving Q1
longer reduces BFQ's overhead. As a conclusion, it is usually better
not to preempt Q1 if both Q1 and Q2 have the same weight.

In contrast, as Q2's weight or priority becomes higher and higher
compared with that of Q1, the above delay becomes larger and larger,
compared with the I/O completion times that have to be guaranteed to
Q2 according to Q2's weight. So reducing this delay may be more
important than avoiding the costs of preempting Q1.

Accordingly, this commit preempts Q1 if Q2 has a higher weight or a
higher priority than Q1. Preemption causes Q1 to be re-scheduled, and
triggers a new choice of the next bfq_queue to serve. If Q2 really is
the next bfq_queue to serve, then Q2 will be set in service
immediately.

This change reduces the component of the I/O latency caused by the
above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5
SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for
a process doing sporadic random reads while another process is doing
continuous sequential reads.

Signed-off-by: Nicola Bottura
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Paolo Valente
2019-06-25 23:07:35 +0800
13a857a4c block, bfq: detect wakers and unconditionally inject their I/O ... Browse Code »

A bfq_queue Q may happen to be synchronized with another
bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
receive new I/O. We call Q2 "waker queue".

If I/O plugging is being performed for Q, and Q is not receiving any
more I/O because of the above synchronization, then, thanks to BFQ's
injection mechanism, the waker queue is likely to get served before
the I/O-plugging timeout fires.

Unfortunately, this fact may not be sufficient to guarantee a high
throughput during the I/O plugging, because the inject limit for Q may
be too low to guarantee a lot of injected I/O. In addition, the
duration of the plugging, i.e., the time before Q finally receives new
I/O, may not be minimized, because the waker queue may happen to be
served only after other queues.

To address these issues, this commit introduces the explicit detection
of the waker queue, and the unconditional injection of a pending I/O
request of the waker queue on each invocation of
bfq_dispatch_request().

One may be concerned that this systematic injection of I/O from the
waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
the contrary, next Q's I/O is brought forward dramatically, for it is
not blocked for milliseconds.

Reported-by: Srivatsa S. Bhat (VMware)
Tested-by: Srivatsa S. Bhat (VMware)
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Paolo Valente
2019-06-25 23:07:34 +0800
a3f9bce36 block, bfq: bring forward seek&think time update ... Browse Code »

Until the base value for request service times gets finally computed
for a bfq_queue, the inject limit for that queue does depend on the
think-time state (short|long) of the queue. A timely update of the
think time then guarantees a quicker activation or deactivation of the
injection. Fortunately, the think time of a bfq_queue is updated in
the same code path as the inject limit; but after the inject limit.

This commits moves the update of the think time before the update of
the inject limit. For coherence, it moves the update of the seek time
too.

Reported-by: Srivatsa S. Bhat (VMware)
Tested-by: Srivatsa S. Bhat (VMware)
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Paolo Valente
2019-06-25 23:07:34 +0800
24792ad01 block, bfq: update base request service times when possible ... Browse Code »

I/O injection gets reduced if it increases the request service times
of the victim queue beyond a certain threshold. The threshold, in its
turn, is computed as a function of the base service time enjoyed by
the queue when it undergoes no injection.

As a consequence, for injection to work properly, the above base value
has to be accurate. In this respect, such a value may vary over
time. For example, it varies if the size or the spatial locality of
the I/O requests in the queue change. It is then important to update
this value whenever possible. This commit performs this update.

Reported-by: Srivatsa S. Bhat (VMware)
Tested-by: Srivatsa S. Bhat (VMware)
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Paolo Valente
2019-06-25 23:07:34 +0800
db599f9ed block, bfq: fix rq_in_driver check in bfq_update_inject_limit ... Browse Code »

One of the cases where the parameters for injection may be updated is
when there are no more in-flight I/O requests. The number of in-flight
requests is stored in the field bfqd->rq_in_driver of the descriptor
bfqd of the device. So, the controlled condition is
bfqd->rq_in_driver == 0.

Unfortunately, this is wrong because, the instruction that checks this
condition is in the code path that handles the completion of a
request, and, in particular, the instruction is executed before
bfqd->rq_in_driver is decremented in such a code path.

This commit fixes this issue by just replacing 0 with 1 in the
comparison.

Reported-by: Srivatsa S. Bhat (VMware)
Tested-by: Srivatsa S. Bhat (VMware)
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Paolo Valente
2019-06-25 23:07:34 +0800
766d61412 block, bfq: reset inject limit when think-time state changes ... Browse Code »

Until the base value of the request service times gets finally
computed for a bfq_queue, the inject limit does depend on the
think-time state (short|long). The limit must be 0 or 1 if the think
time is deemed, respectively, as short or long. However, such a check
and possible limit update is performed only periodically, once per
second. So, to make the injection mechanism much more reactive, this
commit performs the update also every time the think-time state
changes.

In addition, in the following special case, this commit lets the
inject limit of a bfq_queue bfqq remain equal to 1 even if bfqq's
think time is short: bfqq's I/O is synchronized with that of some
other queue, i.e., bfqq may receive new I/O only after the I/O of the
other queue is completed. Keeping the inject limit to 1 allows the
blocking I/O to be served while bfqq is in service. And this is very
convenient both for bfqq and for the total throughput, as explained
in detail in the comments in bfq_update_has_short_ttime().

Reported-by: Srivatsa S. Bhat (VMware)
Tested-by: Srivatsa S. Bhat (VMware)
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Paolo Valente
2019-06-25 23:07:34 +0800

21 Jun, 2019

10 commits

b0e5168a7 block: update print_req_error() ... Browse Code »

Improve the print_req_error with additional request fields which are
helpful for debugging. Use newly introduced blk_op_str() to print the
REQ_OP_XXX in the string format.

Reviewed-by: Chao Yu
Signed-off-by: Chaitanya Kulkarni
Signed-off-by: Jens Axboe

Chaitanya Kulkarni
2019-06-21 03:03:51 +0800
874c893bf block: use blk_op_str() in blk-mq-debugfs.c ... Browse Code »

Now that we've a helper function blk_op_str() to convert the
REQ_OP_XXX to string XXX, adjust the code to use that. Get rid of
the duplicate array op_name which is now present in the blk-core.c
which we renamed it to "blk_op_name" and open coding in the
blk-mq-debugfs.c.

Reviewed-by: Bart Van Assche
Signed-off-by: Chaitanya Kulkarni
Signed-off-by: Jens Axboe

Chaitanya Kulkarni
2019-06-21 03:03:51 +0800
e47bc4eda block: add centralize REQ_OP_XXX to string helper ... Browse Code »

In order to centralize the REQ_OP_XXX to string conversion which can be
used in the block layer and different places in the kernel like f2fs,
this patch adds a new helper function along with an array similar to the
one present in the blk-mq-debugfs.c.

We keep this helper functionality centralize under blk-core.c instead of
blk-mq-debugfs.c since blk-core.c is configured using CONFIG_BLOCK and
it will not be dependent on blk-mq-debugfs.c which is configured using
CONFIG_BLK_DEBUG_FS.

Next patch adjusts the code in the blk-mq-debugfs.c with newly
introduced helper.

Reviewed-by: Bart Van Assche
Signed-off-by: Chaitanya Kulkarni
Signed-off-by: Jens Axboe

Chaitanya Kulkarni
2019-06-21 03:03:51 +0800
178cc590e block: improve print_req_error ... Browse Code »

Print the calling function instead of print_req_error as a prefix, and
print the operation and op_flags separately instead of the whole field.

Reviewed-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Chaitanya Kulkarni
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 03:03:51 +0800
8060c47ba block: rename CONFIG_DEBUG_BLK_CGROUP to CONFIG_BFQ_CGROUP_DEBUG ... Browse Code »

This option is entirely bfq specific, give it an appropinquate name.

Also make it depend on CONFIG_BFQ_GROUP_IOSCHED in Kconfig, as all
the functionality already does so anyway.

Acked-by: Tejun Heo
Acked-by: Paolo Valente
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 00:32:35 +0800
d6258980d bfq-iosched: move bfq_stat_recursive_sum into the only caller ... Browse Code »

This function was moved from core block code and is way to generic.
Fold it into the only caller and simplify it based on the actually
passed arguments.

Acked-by: Tejun Heo
Acked-by: Paolo Valente
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 00:32:34 +0800
c0ce79dca blk-cgroup: move struct blkg_stat to bfq ... Browse Code »

This structure and assorted infrastructure is only used by the bfq I/O
scheduler. Move it there instead of bloating the common code.

Acked-by: Tejun Heo
Acked-by: Paolo Valente
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 00:32:34 +0800
7af6fd911 blk-cgroup: introduce a new struct blkg_rwstat_sample ... Browse Code »

When sampling the blkcg counts we don't need atomics or per-cpu
variables. Introduce a new structure just containing plain u64
counters.

Acked-by: Tejun Heo
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 00:32:34 +0800
5d0b6e48c blk-cgroup: pass blkg_rwstat structures by reference ... Browse Code »

Returning a structure generates rather bad code, so switch to passing
by reference. Also don't require the structure to be zeroed and add
to the 0-initialized counters, but actually set the counters to the
calculated value.

Acked-by: Tejun Heo
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 00:32:34 +0800
239eeb085 blk-cgroup: factor out a helper to read rwstat counter ... Browse Code »

Trying to break up the crazy statements to something readable.
Also switch to an unsigned counter as it can't ever turn negative.

Reviewed-by: Chaitanya Kulkarni
Acked-by: Tejun Heo
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 00:32:34 +0800