Eric Lee / smarc-fsl-linux-kernel

12 Dec, 2018

1 commit

927b6b2d6 block: Fix null_blk_zoned creation failure with small number of zones ... Browse Code »

null_blk_zoned creation fails if the number of zones specified is equal to or is
smaller than 64 due to a memory allocation failure in blk_alloc_zones(). With
such a small number of zones, the required memory size for all zones descriptors
fits in a single page, and the page order for alloc_pages_node() is zero. Allow
this value in blk_alloc_zones() for the allocation to succeed.

Fixes: bf5054569653 "block: Introduce blk_revalidate_disk_zones()"
Reviewed-by: Damien Le Moal
Signed-off-by: Shin'ichiro Kawasaki
Signed-off-by: Jens Axboe

Shin'ichiro Kawasaki
2018-12-12 07:19:38 +0800

11 Dec, 2018

1 commit

f55adad60 block/bio: Do not zero user pages ... Browse Code »

We don't need to zero fill the bio if not using kernel allocated pages.

Fixes: f3587d76da05 ("block: Clear kernel memory before copying to user") # v4.20-rc2
Reported-by: Todd Aiken
Cc: Laurence Oberman
Cc: stable@vger.kernel.org
Cc: Bart Van Assche
Tested-by: Laurence Oberman
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Keith Busch
2018-12-11 04:37:20 +0800

07 Dec, 2018

2 commits

c616cbee9 blk-mq: punt failed direct issue to dispatch list ... Browse Code »

After the direct dispatch corruption fix, we permanently disallow direct
dispatch of non read/write requests. This works fine off the normal IO
path, as they will be retried like any other failed direct dispatch
request. But for the blk_insert_cloned_request() that only DM uses to
bypass the bottom level scheduler, we always first attempt direct
dispatch. For some types of requests, that's now a permanent failure,
and no amount of retrying will make that succeed. This results in a
livelock.

Instead of making special cases for what we can direct issue, and now
having to deal with DM solving the livelock while still retaining a BUSY
condition feedback loop, always just add a request that has been through
->queue_rq() to the hardware queue dispatch list. These are safe to use
as no merging can take place there. Additionally, if requests do have
prepped data from drivers, we aren't dependent on them not sharing space
in the request structure to safely add them to the IO scheduler lists.

This basically reverts ffe81d45322c and is based on a patch from Ming,
but with the list insert case covered as well.

Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
Cc: stable@vger.kernel.org
Suggested-by: Ming Lei
Reported-by: Bart Van Assche
Tested-by: Ming Lei
Acked-by: Mike Snitzer
Signed-off-by: Jens Axboe

Jens Axboe
2018-12-07 23:16:11 +0800
ba7aeae55 block, bfq: fix decrement of num_active_groups ... Browse Code »

Since commit '2d29c9f89fcd ("block, bfq: improve asymmetric scenarios
detection")', if there are process groups with I/O requests waiting for
completion, then BFQ tags the scenario as 'asymmetric'. This detection
is needed for preserving service guarantees (for details, see comments
on the computation * of the variable asymmetric_scenario in the
function bfq_better_to_idle).

Unfortunately, commit '2d29c9f89fcd ("block, bfq: improve asymmetric
scenarios detection")' contains an error exactly in the updating of
the number of groups with I/O requests waiting for completion: if a
group has more than one descendant process, then the above number of
groups, which is renamed from num_active_groups to a more appropriate
num_groups_with_pending_reqs by this commit, may happen to be wrongly
decremented multiple times, namely every time one of the descendant
processes gets all its pending I/O requests completed.

A correct, complete solution should work as follows. Consider a group
that is inactive, i.e., that has no descendant process with pending
I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs
is still accounting for this group, because the group still has some
descendant process with some I/O request still in
flight. num_groups_with_pending_reqs should be decremented when the
in-flight request of the last descendant process is finally completed
(assuming that nothing else has changed for the group in the meantime,
in terms of composition of the group and active/inactive state of
child groups and processes). To accomplish this, an additional
pending-request counter must be added to entities, and must be
updated correctly.

To avoid this additional field and operations, this commit resorts to
the following tradeoff between simplicity and accuracy: for an
inactive group that is still counted in num_groups_with_pending_reqs,
this commit decrements num_groups_with_pending_reqs when the first
descendant process of the group remains with no request waiting for
completion.

This simplified scheme provides a fix to the unbalanced decrements
introduced by 2d29c9f89fcd. Since this error was also caused by lack
of comments on this non-trivial issue, this commit also adds related
comments.

Fixes: 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")
Reported-by: Steven Barrett
Tested-by: Steven Barrett
Tested-by: Lucjan Lucjanov
Reviewed-by: Federico Motta
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Paolo Valente
2018-12-07 22:40:07 +0800

05 Dec, 2018

1 commit

ffe81d453 blk-mq: fix corruption with direct issue ... Browse Code »

If we attempt a direct issue to a SCSI device, and it returns BUSY, then
we queue the request up normally. However, the SCSI layer may have
already setup SG tables etc for this particular command. If we later
merge with this request, then the old tables are no longer valid. Once
we issue the IO, we only read/write the original part of the request,
not the new state of it.

This causes data corruption, and is most often noticed with the file
system complaining about the just read data being invalid:

[ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)

because most of it is garbage...

This doesn't happen from the normal issue path, as we will simply defer
the request to the hardware queue dispatch list if we fail. Once it's on
the dispatch list, we never merge with it.

Fix this from the direct issue path by flagging the request as
REQ_NOMERGE so we don't change the size of it before issue.

See also:
https://bugzilla.kernel.org/show_bug.cgi?id=201685

Tested-by: Guenter Roeck
Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe

Jens Axboe
2018-12-05 11:06:48 +0800

01 Dec, 2018

1 commit

2a5cf35cd block: fix single range discard merge ... Browse Code »

There are actually two kinds of discard merge:

- one is the normal discard merge, just like normal read/write request,
and call it single-range discard

- another is the multi-range discard, queue_max_discard_segments(rq->q) > 1

For the former case, queue_max_discard_segments(rq->q) is 1, and we
should handle this kind of discard merge like the normal read/write
request.

This patch fixes the following kernel panic issue[1], which is caused by
not removing the single-range discard request from elevator queue.

Guangwu has one raid discard test case, in which this issue is a bit
easier to trigger, and I verified that this patch can fix the kernel
panic issue in Guangwu's test case.

[1] kernel panic log from Jens's report

BUG: unable to handle kernel NULL pointer dereference at 0000000000000148
PGD 0 P4D 0.
Oops: 0000 [#1] SMP PTI
CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted \
4.20.0-rc3-00649-ge64d9a554a91-dirty #14 Hardware name: Wiwynn \
Leopard-Orv2/Leopard-DDR BW, BIOS LBM08 03/03/2017 Workqueue: kblockd \
blk_mq_run_work_fn RIP: \
0010:blk_mq_get_driver_tag+0x81/0x120 Code: 24 \
10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 \
0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 \
f6 87 b0 00 00 00 02 RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246 \
RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8
RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000
R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300
R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000
FS: 0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
blk_mq_dispatch_rq_list+0xec/0x480
? elv_rb_del+0x11/0x30
blk_mq_do_dispatch_sched+0x6e/0xf0
blk_mq_sched_dispatch_requests+0xfa/0x170
__blk_mq_run_hw_queue+0x5f/0xe0
process_one_work+0x154/0x350
worker_thread+0x46/0x3c0
kthread+0xf5/0x130
? process_one_work+0x350/0x350
? kthread_destroy_worker+0x50/0x50
ret_from_fork+0x1f/0x30
Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel \
kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii \
cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq \
button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme \
nvme_core fuse sg loop efivarfs autofs4 CR2: 0000000000000148 \

---[ end trace 340a1fb996df1b9b ]---
RIP: 0010:blk_mq_get_driver_tag+0x81/0x120
Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 \
00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 8b 87 48 01 00 00 8b 40 04 39 43 \
20 72 37 f6 87 b0 00 00 00 02

Fixes: 445251d0f4d329a ("blk-mq: fix discard merge with scheduler attached")
Reported-by: Jens Axboe
Cc: Guangwu Zhang
Cc: Christoph Hellwig
Cc: Jianchao Wang
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-12-01 01:07:57 +0800

14 Nov, 2018

2 commits

8dc765d43 SCSI: fix queue cleanup race before queue initialization is done ... Browse Code »

c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has
already fixed this race, however the implied synchronize_rcu()
in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
performance regression.

Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
tried to quiesce queue for avoiding unnecessary synchronize_rcu()
only when queue initialization is done, because it is usual to see
lots of inexistent LUNs which need to be probed.

However, turns out it isn't safe to quiesce queue only when queue
initialization is done. Because when one SCSI command is completed,
the user of sending command can be waken up immediately, then the
scsi device may be removed, meantime the run queue in scsi_end_request()
is still in-progress, so kernel panic can be caused.

In Red Hat QE lab, there are several reports about this kind of kernel
panic triggered during kernel booting.

This patch tries to address the issue by grabing one queue usage
counter during freeing one request and the following run queue.

Fixes: 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
Cc: Andrew Jones
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org
Cc: Martin K. Petersen
Cc: Christoph Hellwig
Cc: James E.J. Bottomley
Cc: stable
Cc: jianchao.wang
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-11-14 23:19:10 +0800
4800bf7bc block: fix 32 bit overflow in __blkdev_issue_discard() ... Browse Code »

A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to
fall into an endless loop in the discard code. The test is creating
a device that is exactly 2^32 sectors in size to test mkfs boundary
conditions around the 32 bit sector overflow region.

mkfs issues a discard for the entire device size by default, and
hence this throws a sector count of 2^32 into
blkdev_issue_discard(). It takes the number of sectors to discard as
a sector_t - a 64 bit value.

The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
takes this sector count and casts it to a 32 bit value before
comapring it against the maximum allowed discard size the device
has. This truncates away the upper 32 bits, and so if the lower 32
bits of the sector count is zero, it starts issuing discards of
length 0. This causes the code to fall into an endless loop, issuing
a zero length discards over and over again on the same sector.

Fixes: ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
Tested-by: Darrick J. Wong
Reviewed-by: Darrick J. Wong
Signed-off-by: Dave Chinner

Killed pointless WARN_ON().

Signed-off-by: Jens Axboe

Dave Chinner
2018-11-14 23:17:18 +0800

13 Nov, 2018

1 commit

ca474b738 block: copy ioprio in __bio_clone_fast() and bounce ... Browse Code »

We need to copy the io priority, too; otherwise the clone will run
with a different priority than the original one.

Fixes: 43b62ce3ff0a ("block: move bio io prio to a new field")
Signed-off-by: Hannes Reinecke
Signed-off-by: Jean Delvare

Fixed up subject, and ordered stores.

Signed-off-by: Jens Axboe

Hannes Reinecke
2018-11-13 01:35:25 +0800

10 Nov, 2018

1 commit

dc5db2186 Merge tag 'for-linus-20181109' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer fixes from Jens Axboe:

- Two fixes for an ubd regression, one for missing locking, and one for
a missing initialization of a field. The latter was an old latent
bug, but it's now visible and triggers (Me, Anton Ivanov)

- Set of NVMe fixes via Christoph, but applied manually due to a git
tree mixup (Christoph, Sagi)

- Fix for a discard split regression, in three patches (Ming)

- Update libata git trees (Geert)

- SPDX identifier for sata_rcar (Kuninori Morimoto)

- Virtual boundary merge fix (Johannes)

- Preemptively clear memory we are going to pass to userspace, in case
the driver does a short read (Keith)

* tag 'for-linus-20181109' of git://git.kernel.dk/linux-block:
block: make sure writesame bio is aligned with logical block size
block: cleanup __blkdev_issue_discard()
block: make sure discard bio is aligned with logical block size
Revert "nvmet-rdma: use a private workqueue for delete"
nvme: make sure ns head inherits underlying device limits
nvmet: don't try to add ns to p2p map unless it actually uses it
sata_rcar: convert to SPDX identifiers
ubd: fix missing initialization of io_req
block: Clear kernel memory before copying to user
MAINTAINERS: Fix remaining pointers to obsolete libata.git
ubd: fix missing lock around request issue
block: respect virtual boundary mask in bvecs

Linus Torvalds
2018-11-10 06:31:51 +0800

09 Nov, 2018

3 commits

34ffec60b block: make sure writesame bio is aligned with logical block size ... Browse Code »

Obviously the created writesame bio has to be aligned with logical block
size, and use bio_allowed_max_sectors() to retrieve this number.

Cc: stable@vger.kernel.org
Cc: Mike Snitzer
Cc: Christoph Hellwig
Cc: Xiao Ni
Cc: Mariusz Dabrowski
Fixes: b49a0871be31a745b2ef ("block: remove split code in blkdev_issue_{discard,write_same}")
Tested-by: Rui Salvaterra
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-11-09 21:23:18 +0800
ba5d73851 block: cleanup __blkdev_issue_discard() ... Browse Code »

Cleanup __blkdev_issue_discard() a bit:

- remove local variable of 'end_sect'
- remove code block of 'fail'

Cc: Mike Snitzer
Cc: Christoph Hellwig
Cc: Xiao Ni
Cc: Mariusz Dabrowski
Tested-by: Rui Salvaterra
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-11-09 21:23:16 +0800
1adfc5e41 block: make sure discard bio is aligned with logical block size ... Browse Code »

Obviously the created discard bio has to be aligned with logical block size.

This patch introduces the helper of bio_allowed_max_sectors() for
this purpose.

Cc: stable@vger.kernel.org
Cc: Mike Snitzer
Cc: Christoph Hellwig
Cc: Xiao Ni
Cc: Mariusz Dabrowski
Fixes: 744889b7cbb56a6 ("block: don't deal with discard limit in blkdev_issue_discard()")
Fixes: a22c4d7e34402cc ("block: re-add discard_granularity and alignment checks")
Reported-by: Rui Salvaterra
Tested-by: Rui Salvaterra
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-11-09 21:23:14 +0800

08 Nov, 2018

2 commits

f3587d76d block: Clear kernel memory before copying to user ... Browse Code »

If the kernel allocates a bounce buffer for user read data, this memory
needs to be cleared before copying it to the user, otherwise it may leak
kernel memory to user space.

Laurence Oberman
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Keith Busch
2018-11-08 06:41:31 +0800
df376b2ed block: respect virtual boundary mask in bvecs ... Browse Code »

With drivers that are settting a virtual boundary constrain, we are
seeing a lot of bio splitting and smaller I/Os being submitted to the
driver.

This happens because the bio gap detection code does not account cases
where PAGE_SIZE - 1 is bigger than queue_virt_boundary() and thus will
split the bio unnecessarily.

Cc: Jan Kara
Cc: Bart Van Assche
Cc: Ming Lei
Reviewed-by: Sagi Grimberg
Signed-off-by: Johannes Thumshirn
Acked-by: Keith Busch
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe

Johannes Thumshirn
2018-11-08 04:04:22 +0800

03 Nov, 2018

1 commit

5f2158538 Merge tag 'for-linus-20181102' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer fixes from Jens Axboe:
"The biggest part of this pull request is the revert of the blkcg
cleanup series. It had one fix earlier for a stacked device issue, but
another one was reported. Rather than play whack-a-mole with this,
revert the entire series and try again for the next kernel release.

Apart from that, only small fixes/changes.

Summary:

- Indentation fixup for mtip32xx (Colin Ian King)

- The blkcg cleanup series revert (Dennis Zhou)

- Two NVMe fixes. One fixing a regression in the nvme request
initialization in this merge window, causing nvme-fc to not work.
The other is a suspend/resume p2p resource issue (James, Keith)

- Fix sg discard merge, allowing us to merge in cases where we didn't
before (Jianchao Wang)

- Call rq_qos_exit() after the queue is frozen, preventing a hang
(Ming)

- Fix brd queue setup, fixing an oops if we fail setting up all
devices (Ming)"

* tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
nvme-pci: fix conflicting p2p resource adds
nvme-fc: fix request private initialization
blkcg: revert blkcg cleanups series
block: brd: associate with queue until adding disk
block: call rq_qos_exit() after queue is frozen
mtip32xx: clean an indentation issue, remove extraneous tabs
block: fix the DISCARD request merge

Linus Torvalds
2018-11-03 02:25:48 +0800

02 Nov, 2018

2 commits

9931a07d5 Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull AFS updates from Al Viro:
"AFS series, with some iov_iter bits included"

* 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
missing bits of "iov_iter: Separate type from direction and use accessor functions"
afs: Probe multiple fileservers simultaneously
afs: Fix callback handling
afs: Eliminate the address pointer from the address list cursor
afs: Allow dumping of server cursor on operation failure
afs: Implement YFS support in the fs client
afs: Expand data structure fields to support YFS
afs: Get the target vnode in afs_rmdir() and get a callback on it
afs: Calc callback expiry in op reply delivery
afs: Fix FS.FetchStatus delivery from updating wrong vnode
afs: Implement the YFS cache manager service
afs: Remove callback details from afs_callback_break struct
afs: Commit the status on a new file/dir/symlink
afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
afs: Don't invoke the server to read data beyond EOF
afs: Add a couple of tracepoints to log I/O errors
afs: Handle EIO from delivery function
afs: Fix TTL on VL server and address lists
afs: Implement VL server rotation
afs: Improve FS server rotation error handling
...

Linus Torvalds
2018-11-02 10:58:52 +0800
b5f2954d3 blkcg: revert blkcg cleanups series ... Browse Code »

This reverts a series committed earlier due to null pointer exception
bug report in [1]. It seems there are edge case interactions that I did
not consider and will need some time to understand what causes the
adverse interactions.

The original series can be found in [2] with a follow up series in [3].

[1] https://www.spinics.net/lists/cgroups/msg20719.html
[2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

This reverts the following commits:
d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

Signed-off-by: Dennis Zhou
Signed-off-by: Jens Axboe

Dennis Zhou
2018-11-02 09:59:53 +0800

31 Oct, 2018

2 commits

57c8a661d mm: remove include/linux/bootmem.h ... Browse Code »

Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.

The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include

@@
@@
- #include
+ #include

[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Signed-off-by: Stephen Rothwell
Acked-by: Michal Hocko
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Kroah-Hartman
Cc: Guan Xuetao
Cc: Ingo Molnar
Cc: "James E.J. Bottomley"
Cc: Jonas Bonn
Cc: Jonathan Corbet
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Martin Schwidefsky
Cc: Matt Turner
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Richard Kuo
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Serge Semin
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vineet Gupta
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2018-10-31 23:54:16 +0800
c57cdf7a9 block: call rq_qos_exit() after queue is frozen ... Browse Code »

rq_qos_exit() removes the current q->rq_qos, this action has to be
done after queue is frozen, otherwise the IO queue path may never
be waken up, then IO hang is caused.

So fixes this issue by moving rq_qos_exit() after queue is frozen.

Cc: Josef Bacik
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-10-31 22:40:36 +0800

29 Oct, 2018

1 commit

698404660 block: fix the DISCARD request merge ... Browse Code »

There are two cases when handle DISCARD merge.
If max_discard_segments == 1, the bios/requests need to be contiguous
to merge. If max_discard_segments > 1, it takes every bio as a range
and different range needn't to be contiguous.

But now, attempt_merge screws this up. It always consider contiguity
for DISCARD for the case max_discard_segments > 1 and cannot merge
contiguous DISCARD for the case max_discard_segments == 1, because
rq_attempt_discard_merge always returns false in this case.
This patch fixes both of the two cases above.

Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: Jianchao Wang
Signed-off-by: Jens Axboe

Jianchao Wang
2018-10-29 23:32:40 +0800

27 Oct, 2018

2 commits

345671ea0 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge updates from Andrew Morton:

- a few misc things

- ocfs2 updates

- most of MM

* emailed patches from Andrew Morton : (132 commits)
hugetlbfs: dirty pages as they are added to pagecache
mm: export add_swap_extent()
mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
mm/gup: cache dev_pagemap while pinning pages
Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
mm: return zero_resv_unavail optimization
mm: zero remaining unavailable struct pages
tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
mm/gup_benchmark.c: add additional pinning methods
mm/gup_benchmark.c: time put_page()
mm: don't raise MEMCG_OOM event due to failed high-order allocation
mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
...

Linus Torvalds
2018-10-27 10:33:41 +0800
8508cf3ff sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD ... Browse Code »

There are several definitions of those functions/macros in places that
mess with fixed-point load averages. Provide an official version.

[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Peter Zijlstra (Intel)
Tested-by: Suren Baghdasaryan
Tested-by: Daniel Drake
Cc: Christopher Lameter
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Mike Galbraith
Cc: Peter Enderborg
Cc: Randy Dunlap
Cc: Shakeel Butt
Cc: Tejun Heo
Cc: Vinayak Menon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2018-10-27 07:26:32 +0800

26 Oct, 2018

10 commits

98fa7a3e0 block, bfq: fix asymmetric scenarios detection ... Browse Code »

Since commit 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios
detection"), a scenario is defined asymmetric when one of the
following conditions holds:
- active bfq_queues have different weights
- one or more group of entities (bfq_queue or other groups of entities)
are active
bfq grants fairness and low latency also in such asymmetric scenarios,
by plugging the dispatching of I/O if the bfq_queue in service happens
to be temporarily idle. This plugging may lower throughput, so it is
important to do it only when strictly needed.

By mistake, in commit '2d29c9f89fcd' ("block, bfq: improve asymmetric
scenarios detection") the num_active_groups counter was firstly
incremented and subsequently decremented at any entity (group or
bfq_queue) weight change.

This is useless, because only transitions from active to inactive and
vice versa matter for that counter. Unfortunately this is also
incorrect in the following case: the entity at issue is a bfq_queue
and it is under weight raising. In fact in this case there is a
spurious increment of the num_active_groups counter.

This spurious increment may cause scenarios to be wrongly detected as
asymmetric, thus causing useless plugging and loss of throughput.

This commit fixes this issue by simply removing the above useless and
wrong increments and decrements.

Fixes: 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")
Tested-by: Oleksandr Natalenko
Signed-off-by: Federico Motta
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe

Federico Motta
2018-10-26 01:17:40 +0800
d6f1dda27 blk-mq: place trace_block_getrq() in correct place ... Browse Code »

trace_block_getrq() is to indicate a request struct has been allocated
for queue, so put it in right place.

Reviewed-by: Jianchao Wang
Signed-off-by: Xiaoguang Wang
Signed-off-by: Jens Axboe

Xiaoguang Wang
2018-10-26 01:17:40 +0800
bf5054569 block: Introduce blk_revalidate_disk_zones() ... Browse Code »

Drivers exposing zoned block devices have to initialize and maintain
correctness (i.e. revalidate) of the device zone bitmaps attached to
the device request queue (seq_zones_bitmap and seq_zones_wlock).

To simplify coding this, introduce a generic helper function
blk_revalidate_disk_zones() suitable for most (and likely all) cases.
This new function always update the seq_zones_bitmap and seq_zones_wlock
bitmaps as well as the queue nr_zones field when called for a disk
using a request based queue. For a disk using a BIO based queue, only
the number of zones is updated since these queues do not have
schedulers and so do not need the zone bitmaps.

With this change, the zone bitmap initialization code in sd_zbc.c can be
replaced with a call to this function in sd_zbc_read_zones(), which is
called from the disk revalidate block operation method.

A call to blk_revalidate_disk_zones() is also added to the null_blk
driver for devices created with the zoned mode enabled.

Finally, to ensure that zoned devices created with dm-linear or
dm-flakey expose the correct number of zones through sysfs, a call to
blk_revalidate_disk_zones() is added to dm_table_set_restrictions().

The zone bitmaps allocated and initialized with
blk_revalidate_disk_zones() are freed automatically from
__blk_release_queue() using the block internal function
blk_queue_free_zone_bitmaps().

Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Reviewed-by: Martin K. Petersen
Reviewed-by: Mike Snitzer
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2018-10-26 01:17:40 +0800
e76239a37 block: add a report_zones method ... Browse Code »

Dispatching a report zones command through the request queue is a major
pain due to the command reply payload rewriting necessary. Given that
blkdev_report_zones() is executing everything synchronously, implement
report zones as a block device file operation instead, allowing major
simplification of the code in many places.

sd, null-blk, dm-linear and dm-flakey being the only block device
drivers supporting exposing zoned block devices, these drivers are
modified to provide the device side implementation of the
report_zones() block device file operation.

For device mappers, a new report_zones() target type operation is
defined so that the upper block layer calls blkdev_report_zones() can
be propagated down to the underlying devices of the dm targets.
Implementation for this new operation is added to the dm-linear and
dm-flakey targets.

Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
[Damien]
* Changed method block_device argument to gendisk
* Various bug fixes and improvements
* Added support for null_blk, dm-linear and dm-flakey.
Reviewed-by: Martin K. Petersen
Reviewed-by: Mike Snitzer
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-10-26 01:17:40 +0800
965b652e9 block: Expose queue nr_zones in sysfs ... Browse Code »

Expose through sysfs the nr_zones field of struct request_queue.
Exposing this value helps in debugging disk issues as well as
facilitating scripts based use of the disk (e.g. blktests).

For zoned block devices, the nr_zones field indicates the total number
of zones of the device calculated using the known disk capacity and
zone size. This number of zones is always 0 for regular block devices.

Since nr_zones is defined conditionally with CONFIG_BLK_DEV_ZONED,
introduce the blk_queue_nr_zones() function to return the correct value
for any device, regardless if CONFIG_BLK_DEV_ZONED is set.

Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2018-10-26 01:17:40 +0800
a2d6b3a2d block: Improve zone reset execution ... Browse Code »

There is no need to synchronously execute all REQ_OP_ZONE_RESET BIOs
necessary to reset a range of zones. Similarly to what is done for
discard BIOs in blk-lib.c, all zone reset BIOs can be chained and
executed asynchronously and a synchronous call done only for the last
BIO of the chain.

Modify blkdev_reset_zones() to operate similarly to
blkdev_issue_discard() using the next_bio() helper for chaining BIOs. To
avoid code duplication of that function in blk_zoned.c, rename
next_bio() into blk_next_bio() and declare it as a block internal
function in blk.h.

Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2018-10-26 01:17:40 +0800
65e4e3eee block: Introduce BLKGETNRZONES ioctl ... Browse Code »

Get a zoned block device total number of zones. The device can be a
partition of the whole device. The number of zones is always 0 for
regular block devices.

Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2018-10-26 01:17:40 +0800
72cd87576 block: Introduce BLKGETZONESZ ioctl ... Browse Code »

Get a zoned block device zone size in number of 512 B sectors.
The zone size is always 0 for regular block devices.

Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2018-10-26 01:17:40 +0800
2e85fbaff block: Limit allocation of zone descriptors for report zones ... Browse Code »

There is no point in allocating more zone descriptors than the number of
zones a block device has for doing a zone report. Avoid doing that in
blkdev_report_zones_ioctl() by limiting the number of zone decriptors
allocated internally to process the user request.

Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2018-10-26 01:17:40 +0800
a91e13802 block: Introduce blkdev_nr_zones() helper ... Browse Code »

Introduce the blkdev_nr_zones() helper function to get the total
number of zones of a zoned block device. This number is always 0 for a
regular block device (q->limits.zoned == BLK_ZONED_NONE case).

Replace hard-coded number of zones calculation in dmz_get_zoned_device()
with a call to this helper.

Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2018-10-26 01:17:40 +0800

24 Oct, 2018

1 commit

00e237074 iov_iter: Use accessor function ... Browse Code »

Use accessor functions to access an iterator's type and direction. This
allows for the possibility of using some other method of determining the
type of iterator than if-chains with bitwise-AND conditions.

Signed-off-by: David Howells

David Howells
2018-10-24 07:40:44 +0800

23 Oct, 2018

1 commit

6ab9e0923 Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer updates from Jens Axboe:
"This is the main pull request for block changes for 4.20. This
contains:

- Series enabling runtime PM for blk-mq (Bart).

- Two pull requests from Christoph for NVMe, with items such as;
- Better AEN tracking
- Multipath improvements
- RDMA fixes
- Rework of FC for target removal
- Fixes for issues identified by static checkers
- Fabric cleanups, as prep for TCP transport
- Various cleanups and bug fixes

- Block merging cleanups (Christoph)

- Conversion of drivers to generic DMA mapping API (Christoph)

- Series fixing ref count issues with blkcg (Dennis)

- Series improving BFQ heuristics (Paolo, et al)

- Series improving heuristics for the Kyber IO scheduler (Omar)

- Removal of dangerous bio_rewind_iter() API (Ming)

- Apply single queue IPI redirection logic to blk-mq (Ming)

- Set of fixes and improvements for bcache (Coly et al)

- Series closing a hotplug race with sysfs group attributes (Hannes)

- Set of patches for lightnvm:
- pblk trace support (Hans)
- SPDX license header update (Javier)
- Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
specs behind a common core interface. (Javier, Matias)
- Enable pblk to use a common interface to retrieve chunk metadata
(Matias)
- Bug fixes (Various)

- Set of fixes and updates to the blk IO latency target (Josef)

- blk-mq queue number updates fixes (Jianchao)

- Convert a bunch of drivers from the old legacy IO interface to
blk-mq. This will conclude with the removal of the legacy IO
interface itself in 4.21, with the rest of the drivers (me, Omar)

- Removal of the DAC960 driver. The SCSI tree will introduce two
replacement drivers for this (Hannes)"

* tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
block: setup bounce bio_sets properly
blkcg: reassociate bios when make_request() is called recursively
blkcg: fix edge case for blk_get_rl() under memory pressure
nvme-fabrics: move controller options matching to fabrics
nvme-rdma: always have a valid trsvcid
mtip32xx: fully switch to the generic DMA API
rsxx: switch to the generic DMA API
umem: switch to the generic DMA API
sx8: switch to the generic DMA API
sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
skd: switch to the generic DMA API
ubd: remove use of blk_rq_map_sg
nvme-pci: remove duplicate check
drivers/block: Remove DAC960 driver
nvme-pci: fix hot removal during error handling
nvmet-fcloop: suppress a compiler warning
nvme-core: make implicit seed truncation explicit
nvmet-fc: fix kernel-doc headers
nvme-fc: rework the request initialization code
nvme-fc: introduce struct nvme_fcp_op_w_sgl
...

Linus Torvalds
2018-10-23 00:46:08 +0800

22 Oct, 2018

1 commit

52990a5fb block: setup bounce bio_sets properly ... Browse Code »

We're only setting up the bounce bio sets if we happen
to need bouncing for regular HIGHMEM, not if we only need
it for ISA devices.

Protect the ISA bounce setup with a mutex, since it's
being invoked from driver init functions and can thus be
called in parallel.

Cc: stable@vger.kernel.org
Reported-by: Ondrej Zary
Tested-by: Ondrej Zary
Signed-off-by: Jens Axboe

Jens Axboe
2018-10-22 02:05:43 +0800

21 Oct, 2018

1 commit

d459d853c blkcg: reassociate bios when make_request() is called recursively ... Browse Code »

When submitting a bio, multiple recursive calls to make_request() may
occur. This causes the initial associate done in blkcg_bio_issue_check()
to be incorrect and reference the prior request_queue. This introduces
a helper to do reassociation when make_request() is recursively called.

Fixes: a7b39b4e961c ("blkcg: always associate a bio with a blkg")
Reported-by: Valdis Kletnieks
Signed-off-by: Dennis Zhou
Tested-by: Valdis Kletnieks
Signed-off-by: Jens Axboe

Dennis Zhou
2018-10-21 05:39:55 +0800

18 Oct, 2018

1 commit

744889b7c block: don't deal with discard limit in blkdev_issue_discard() ... Browse Code »

blk_queue_split() does respect this limit via bio splitting, so no
need to do that in blkdev_issue_discard(), then we can align to
normal bio submit(bio_add_page() & submit_bio()).

More importantly, this patch fixes one issue introduced in a22c4d7e34402cc
("block: re-add discard_granularity and alignment checks"), in which
zero discard bio may be generated in case of zero alignment.

Fixes: a22c4d7e34402ccdf3 ("block: re-add discard_granularity and alignment checks")
Cc: stable@vger.kernel.org
Cc: Ming Lin
Cc: Mike Snitzer
Cc: Christoph Hellwig
Cc: Xiao Ni
Tested-by: Mariusz Dabrowski
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-10-18 21:23:40 +0800

16 Oct, 2018

1 commit

9316a9ed6 blk-mq: provide helper for setting up an SQ queue and tag set ... Browse Code »

This pattern is repeated throughout all the blk-mq conversions.
Provide a basic helper to get it done.

Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe

Jens Axboe
2018-10-16 10:05:18 +0800

15 Oct, 2018

1 commit

5e27891e8 block: remove bogus check for queue_lock assignment ... Browse Code »

We just allocated the queue and haven't even set it up yet,
hence we know that checking if ->mq_ops is NULL is always
going to be true.

In fact we do need to assign a lock to ->queue_lock always,
as we need it for the queue flags modifications.

Signed-off-by: Jens Axboe

Jens Axboe
2018-10-15 02:49:40 +0800