Eric Lee / smarc-fsl-linux-kernel

13 Dec, 2019

2 commits

f41afbe8d bdev: Refresh bdev size for disks without partitioning ... Browse Code »

commit cba22d86e0a10b7070d2e6a7379dbea51aa0883c upstream.

Currently, block device size in not updated on second and further open
for block devices where partition scan is disabled. This is particularly
annoying for example for DVD drives as that means block device size does
not get updated once the media is inserted into a drive if the device is
already open when inserting the media. This is actually always the case
for example when pktcdvd is in use.

Fix the problem by revalidating block device size on every open even for
devices with partition scan disabled.

Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe
Cc: Laura Abbott
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2019-12-13 15:43:19 +0800
f1f3ba27c bdev: Factor out bdev revalidation into a common helper ... Browse Code »

commit 731dc4868311ee097757b8746eaa1b4f8b2b4f1c upstream.

Factor out code handling revalidation of bdev on disk change into a
common helper.

Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe
Cc: Laura Abbott
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2019-12-13 15:43:18 +0800

19 Sep, 2019

1 commit

e6bc9de71 Merge tag 'vfs-5.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux ... Browse Code »

Pull swap access updates from Darrick Wong:
"Prohibit writing to active swap files and swap partitions.

There's no non-malicious use case for allowing userspace to scribble
on storage that the kernel thinks it owns"

* tag 'vfs-5.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
vfs: don't allow writes to swap files
mm: set S_SWAPFILE on blockdev swap devices

Linus Torvalds
2019-09-19 08:35:20 +0800

20 Aug, 2019

1 commit

dc617f29d vfs: don't allow writes to swap files ... Browse Code »

Don't let userspace write to an active swap file because the kernel
effectively has a long term lease on the storage and things could get
seriously corrupted if we let this happen.

Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig

Darrick J. Wong
2019-08-20 22:55:16 +0800

16 Aug, 2019

1 commit

7b6620d7d block: remove REQ_NOWAIT_INLINE ... Browse Code »

We had a few issues with this code, and there's still a problem around
how we deal with error handling for chained/split bios. For now, just
revert the code and we'll try again with a thoroug solution. This
reverts commits:

e15c2ffa1091 ("block: fix O_DIRECT error handling for bio fragments")
0eb6ddfb865c ("block: Fix __blkdev_direct_IO() for bio fragments")
6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
893a1c97205a ("blk-mq: allow REQ_NOWAIT to return an error inline")

Signed-off-by: Jens Axboe

Jens Axboe
2019-08-16 01:09:16 +0800

08 Aug, 2019

2 commits

e91455bad bdev: Fixup error handling in blkdev_get() ... Browse Code »

Commit 89e524c04fa9 ("loop: Fix mount(2) failure due to race with
LOOP_SET_FD") converted blkdev_get() to use the new helpers for
finishing claiming of a block device. However the conversion botched the
error handling in blkdev_get() and thus the bdev has been marked as held
even in case __blkdev_get() returned error. This led to occasional
warnings with block/001 test from blktests like:

kernel: WARNING: CPU: 5 PID: 907 at fs/block_dev.c:1899 __blkdev_put+0x396/0x3a0

Correct the error handling.

CC: stable@vger.kernel.org
Fixes: 89e524c04fa9 ("loop: Fix mount(2) failure due to race with LOOP_SET_FD")
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2019-08-08 21:37:03 +0800
e15c2ffa1 block: fix O_DIRECT error handling for bio fragments ... Browse Code »

0eb6ddfb865c tried to fix this up, but introduced a use-after-free
of dio. Additionally, we still had an issue with error handling,
as reported by Darrick:

"I noticed a regression in xfs/747 (an unreleased xfstest for the
xfs_scrub media scanning feature) on 5.3-rc3. I'll condense that down
to a simpler reproducer:

error-test: 0 209 linear 8:48 0
error-test: 209 1 error
error-test: 210 6446894 linear 8:48 210

Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
for sector 209 and to pass the io to the scsi device everywhere else.

On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
(in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
EIO like you'd expect:

pread64(3, 0x7f880e1c7000, 1048576, 0) = -1 EIO (Input/output error)
pread: Input/output error
+++ exited with 0 +++

But doing it with a larger buffer succeeds(!):

pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
read 1146880/1146880 bytes at offset 0
1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
+++ exited with 0 +++

(Note that the part of the buffer corresponding to the dm-error area is
uninitialized)

On 5.3-rc2, both commands would fail with EIO like you'd expect. The
only change between rc2 and rc3 is commit 0eb6ddfb865c ("block: Fix
__blkdev_direct_IO() for bio fragments").

AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
and a second one for the 96k after that."

Fix this by noting that it's always safe to dereference dio if we get
BLK_QC_T_EAGAIN returned, as end_io hasn't been run for that case. So
we can safely increment the dio size before calling submit_bio(), and
then decrement it on failure (not that it really matters, as the bio
and dio are going away).

For error handling, return to the original method of just using 'ret'
for tracking the error, and the size tracking in dio->size.

Fixes: 0eb6ddfb865c ("block: Fix __blkdev_direct_IO() for bio fragments")
Fixes: 6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
Reported-by: Darrick J. Wong
Signed-off-by: Jens Axboe

Jens Axboe
2019-08-08 02:19:43 +0800

02 Aug, 2019

1 commit

0eb6ddfb8 block: Fix __blkdev_direct_IO() for bio fragments ... Browse Code »

The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
(patch 6a43074e2f46) introduced two problems with BIO fragment handling
for direct IOs:
1) The dio size processed is calculated by incrementing the ret variable
by the size of the bio fragment issued for the dio. However, this size
is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
which may result in referencing the bi_size value after the bio
completed, resulting in an incorrect value use.
2) The ret variable is not incremented by the size of the last bio
fragment issued for the bio, leading to an invalid IO size being
returned to the user.

Fix both problem by using dio->size (which is incremented before the bio
submission) to update the value of ret after bio submissions, including
for the last bio fragment issued.

Fixes: 6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
Reported-by: Masato Suzuki
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2019-08-02 03:51:18 +0800

31 Jul, 2019

1 commit

89e524c04 loop: Fix mount(2) failure due to race with LOOP_SET_FD ... Browse Code »

Commit 33ec3e53e7b1 ("loop: Don't change loop device under exclusive
opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
while it updates loop device binding. However this can make perfectly
valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
temporarily the exclusive bdev reference in cases like this:

for i in {a..z}{a..z}; do
dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
mkfs.ext2 $i.image
mkdir mnt$i
done

echo "Run"
for i in {a..z}{a..z}; do
mount -o loop -t ext2 $i.image mnt$i &
done

Fix the problem by not getting full exclusive bdev reference in
LOOP_SET_FD but instead just mark the bdev as being claimed while we
update the binding information. This just blocks new exclusive openers
instead of failing them with EBUSY thus fixing the problem.

Fixes: 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener")
Cc: stable@vger.kernel.org
Tested-by: Kai-Heng Feng
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2019-07-31 03:16:57 +0800

27 Jul, 2019

1 commit

044128196 Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:

- Several io_uring fixes/improvements:
- Blocking fix for O_DIRECT (me)
- Latter page slowness for registered buffers (me)
- Fix poll hang under certain conditions (me)
- Defer sequence check fix for wrapped rings (Zhengyuan)
- Mismatch in async inc/dec accounting (Zhengyuan)
- Memory ordering issue that could cause stall (Zhengyuan)
- Track sequential defer in bytes, not pages (Zhengyuan)

- NVMe pull request from Christoph

- Set of hang fixes for wbt (Josef)

- Redundant error message kill for libahci (Ding)

- Remove unused blk_mq_sched_started_request() and related ops (Marcos)

- drbd dynamic alloc shash descriptor to reduce stack use (Arnd)

- blkcg ->pd_stat() non-debug print (Tejun)

- bcache memory leak fix (Wei)

- Comment fix (Akinobu)

- BFQ perf regression fix (Paolo)

* tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
io_uring: ensure ->list is initialized for poll commands
Revert "nvme-pci: don't create a read hctx mapping without read queues"
nvme: fix multipath crash when ANA is deactivated
nvme: fix memory leak caused by incorrect subsystem free
nvme: ignore subnqn for ADATA SX6000LNP
drbd: dynamically allocate shash descriptor
block: blk-mq: Remove blk_mq_sched_started_request and started_request
bcache: fix possible memory leak in bch_cached_dev_run()
io_uring: track io length in async_list based on bytes
io_uring: don't use iov_iter_advance() for fixed buffers
block: properly handle IOCB_NOWAIT for async O_DIRECT IO
blk-mq: allow REQ_NOWAIT to return an error inline
io_uring: add a memory barrier before atomic_read
rq-qos: use a mb for got_token
rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
rq-qos: don't reset has_sleepers on spurious wakeups
rq-qos: fix missed wake-ups in rq_qos_throttle
wait: add wq_has_single_sleeper helper
block, bfq: check also in-flight I/O in dispatch plugging
block: fix sysfs module parameters directory path in comment
...

Linus Torvalds
2019-07-27 01:32:12 +0800

22 Jul, 2019

1 commit

6a43074e2 block: properly handle IOCB_NOWAIT for async O_DIRECT IO ... Browse Code »

A caller is supposed to pass in REQ_NOWAIT if we can't block for any
given operation, but O_DIRECT for block devices just ignore this. Hence
we'll block for various resource shortages on the block layer side,
like having to wait for requests.

Use the new REQ_NOWAIT_INLINE to ask for this error to be returned
inline, so we can handle it appropriately and return -EAGAIN to the
caller.

Signed-off-by: Jens Axboe

Jens Axboe
2019-07-22 11:46:29 +0800

20 Jul, 2019

1 commit

933a90bf4 Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs mount updates from Al Viro:
"The first part of mount updates.

Convert filesystems to use the new mount API"

* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
mnt_init(): call shmem_init() unconditionally
constify ksys_mount() string arguments
don't bother with registering rootfs
init_rootfs(): don't bother with init_ramfs_fs()
vfs: Convert smackfs to use the new mount API
vfs: Convert selinuxfs to use the new mount API
vfs: Convert securityfs to use the new mount API
vfs: Convert apparmorfs to use the new mount API
vfs: Convert openpromfs to use the new mount API
vfs: Convert xenfs to use the new mount API
vfs: Convert gadgetfs to use the new mount API
vfs: Convert oprofilefs to use the new mount API
vfs: Convert ibmasmfs to use the new mount API
vfs: Convert qib_fs/ipathfs to use the new mount API
vfs: Convert efivarfs to use the new mount API
vfs: Convert configfs to use the new mount API
vfs: Convert binfmt_misc to use the new mount API
convenience helper: get_tree_single()
convenience helper get_tree_nodev()
vfs: Kill sget_userns()
...

Linus Torvalds
2019-07-20 01:42:02 +0800

29 Jun, 2019

2 commits

9fec4a218 block_dev: use bio_release_pages in bio_unmap_user ... Browse Code »

Use bio_release_pages instead of duplicating it.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-29 23:47:31 +0800
57dfe3ce1 block_dev: use bio_release_pages in blkdev_bio_end_io ... Browse Code »

Use bio_release_pages instead of duplicating it.

Reviewed-by: Minwoo Im
Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-29 23:47:31 +0800

27 May, 2019

1 commit

31cb1d64d block: Don't revalidate bdev of hidden gendisk ... Browse Code »

When hidden gendisk is revalidated, there's no point in revalidating
associated block device as there's none. We would thus just create new
bdev inode, report "detected capacity change from 0 to XXX" message and
evict the bdev inode again. Avoid this pointless dance and confusing
message in the kernel log.

Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2019-05-27 21:35:02 +0800

26 May, 2019

2 commits

9030d16eb vfs: Convert bdev to use the new mount API ... Browse Code »

Convert the bdev filesystem to the new internal mount API as the old
one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells
cc: Jens Axboe
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro

David Howells
2019-05-26 06:00:05 +0800
1f58bb18f mount_pseudo(): drop 'name' argument, switch to d_make_root() ... Browse Code »

Once upon a time we used to set ->d_name of e.g. pipefs root
so that d_path() on pipes would work. These days it's
completely pointless - dentries of pipes are not even connected
to pipefs root. However, mount_pseudo() had set the root
dentry name (passed as the second argument) and callers
kept inventing names to pass to it. Including those that
didn't *have* any non-root dentries to start with...

All of that had been pointless for about 8 years now; it's
time to get rid of that cargo-culting...

Signed-off-by: Al Viro

Al Viro
2019-05-26 05:59:24 +0800

21 May, 2019

1 commit

457c89965 treewide: Add SPDX license identifier for missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800

15 May, 2019

1 commit

3813393f5 fs/block_dev.c: Remove duplicate header ... Browse Code »

linux/dax.h is included more than once.

Link: http://lkml.kernel.org/r/5c867e95.1c69fb81.4f15a.e5e4@mx.google.com
Signed-off-by: Sabyasachi Gupta
Acked-by: Souptick Joarder
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sabyasachi Gupta
2019-05-15 10:52:52 +0800

08 May, 2019

3 commits

149e703cb Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:
"Assorted stuff, with no common topic whatsoever..."

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
libfs: document simple_get_link()
Documentation/filesystems/Locking: fix ->get_link() prototype
Documentation/filesystems/vfs.txt: document how ->i_link works
Documentation/filesystems/vfs.txt: remove bogus "Last updated" date
fs: use timespec64 in relatime_need_update
fs/block_dev.c: remove unused include

Linus Torvalds
2019-05-08 11:50:27 +0800
67a242223 Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:
"Nothing major in this series, just fixes and improvements all over the
map. This contains:

- Series of fixes for sed-opal (David, Jonas)

- Fixes and performance tweaks for BFQ (via Paolo)

- Set of fixes for bcache (via Coly)

- Set of fixes for md (via Song)

- Enabling multi-page for passthrough requests (Ming)

- Queue release fix series (Ming)

- Device notification improvements (Martin)

- Propagate underlying device rotational status in loop (Holger)

- Removal of mtip32xx trim support, which has been disabled for years
(Christoph)

- Improvement and cleanup of nvme command handling (Christoph)

- Add block SPDX tags (Christoph)

- Cleanup/hardening of bio/bvec iteration (Christoph)

- A few NVMe pull requests (Christoph)

- Removal of CONFIG_LBDAF (Christoph)

- Various little fixes here and there"

* tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
block: fix mismerge in bvec_advance
block: don't drain in-progress dispatch in blk_cleanup_queue()
blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
blk-mq: always free hctx after request queue is freed
blk-mq: split blk_mq_alloc_and_init_hctx into two parts
blk-mq: free hw queue's resource in hctx's release handler
blk-mq: move cancel of requeue_work into blk_mq_release
blk-mq: grab .q_usage_counter when queuing request from plug code path
block: fix function name in comment
nvmet: protect discovery change log event list iteration
nvme: mark nvme_core_init and nvme_core_exit static
nvme: move command size checks to the core
nvme-fabrics: check more command sizes
nvme-pci: check more command sizes
nvme-pci: remove an unneeded variable initialization
nvme-pci: unquiesce admin queue on shutdown
nvme-pci: shutdown on timeout during deletion
nvme-pci: fix psdt field for single segment sgls
nvme-multipath: don't print ANA group state by default
nvme-multipath: split bios with the ns_head bio_set before submitting
...

Linus Torvalds
2019-05-08 09:14:36 +0800
168e153d5 Merge branch 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs inode freeing updates from Al Viro:
"Introduction of separate method for RCU-delayed part of
->destroy_inode() (if any).

Pretty much as posted, except that destroy_inode() stashes
->free_inode into the victim (anon-unioned with ->i_fops) before
scheduling i_callback() and the last two patches (sockfs conversion
and folding struct socket_wq into struct socket) are excluded - that
pair should go through netdev once davem reopens his tree"

* 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits)
orangefs: make use of ->free_inode()
shmem: make use of ->free_inode()
hugetlb: make use of ->free_inode()
overlayfs: make use of ->free_inode()
jfs: switch to ->free_inode()
fuse: switch to ->free_inode()
ext4: make use of ->free_inode()
ecryptfs: make use of ->free_inode()
ceph: use ->free_inode()
btrfs: use ->free_inode()
afs: switch to use of ->free_inode()
dax: make use of ->free_inode()
ntfs: switch to ->free_inode()
securityfs: switch to ->free_inode()
apparmor: switch to ->free_inode()
rpcpipe: switch to ->free_inode()
bpf: switch to ->free_inode()
mqueue: switch to ->free_inode()
ufs: switch to ->free_inode()
coda: switch to ->free_inode()
...

Linus Torvalds
2019-05-08 01:57:05 +0800

02 May, 2019

1 commit

41149cb08 bdev: switch to ->free_inode() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2019-05-02 10:43:24 +0800

01 May, 2019

1 commit

60a27b906 block: fix handling for BIO_NO_PAGE_REF ... Browse Code »

Commit 399254aaf489211 ("block: add BIO_NO_PAGE_REF flag") introduces
BIO_NO_PAGE_REF, and once this flag is set for one bio, all pages
in the bio won't be get/put during IO.

However, if one bio is submitted via __blkdev_direct_IO_simple(),
even though BIO_NO_PAGE_REF is set, pages still may be put.

Fixes this issue by avoiding to put pages if BIO_NO_PAGE_REF is
set.

Fixes: 399254aaf489211 ("block: add BIO_NO_PAGE_REF flag")
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-05-01 22:38:47 +0800

30 Apr, 2019

1 commit

2b070cfe5 block: remove the i argument to bio_for_each_segment_all ... Browse Code »

We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.

Suggested-by: Matthew Wilcox
Reviewed-by: Johannes Thumshirn
Acked-by: David Sterba
Reviewed-by: Hannes Reinecke
Acked-by: Coly Li
Reviewed-by: Matthew Wilcox
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-04-30 23:26:13 +0800

12 Apr, 2019

1 commit

a89afe58f block: fix the return errno for direct IO ... Browse Code »

If the last bio returned is not dio->bio, the status of the bio will
not assigned to dio->bio if it is error. This will cause the whole IO
status wrong.

ksoftirqd/21-117 [021] ..s. 4017.966090: 8,0 C N 4883648 [0]
-0 [018] ..s. 4017.970888: 8,0 C WS 4924800 + 1024 [0]
-0 [018] ..s. 4017.970909: 8,0 D WS 4935424 + 1024 []
-0 [018] ..s. 4017.970924: 8,0 D WS 4936448 + 321 []
ksoftirqd/21-117 [021] ..s. 4017.995033: 8,0 C R 4883648 + 336 [65475]
ksoftirqd/21-117 [021] d.s. 4018.001988: myprobe1: (blkdev_bio_end_io+0x0/0x168) bi_status=7
ksoftirqd/21-117 [021] d.s. 4018.001992: myprobe: (aio_complete_rw+0x0/0x148) x0=0xffff802f2595ad80 res=0x12a000 res2=0x0

We always have to assign bio->bi_status to dio->bio.bi_status because we
will only check dio->bio.bi_status when we return the whole IO to
the upper layer.

Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig
Cc: Jens Axboe
Reviewed-by: Ming Lei
Signed-off-by: Jason Yan
Signed-off-by: Jens Axboe

Jason Yan
2019-04-12 11:22:21 +0800

10 Apr, 2019

1 commit

6d46d2934 fs/block_dev.c: remove unused include ... Browse Code »

Just remove unused include from
fs/block_dev.c.

Signed-off-by: Chengguang Xu
Reviewed-by: Bart Van Assche
Signed-off-by: Al Viro

Chengguang Xu
2019-04-10 07:15:39 +0800

19 Mar, 2019

1 commit

399254aaf block: add BIO_NO_PAGE_REF flag ... Browse Code »

If bio_iov_iter_get_pages() is called on an iov_iter that is flagged
with NO_REF, then we don't need to add a page reference for the pages
that we add.

Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows
not to drop a reference to these pages.

Signed-off-by: Jens Axboe

Jens Axboe
2019-03-19 00:44:48 +0800

24 Feb, 2019

2 commits

0bbb280d7 block: add bio_set_polled() helper ... Browse Code »

For the upcoming async polled IO, we can't sleep allocating requests.
If we do, then we introduce a deadlock where the submitter already
has async polled IO in-flight, but can't wait for them to complete
since polled requests must be active found and reaped.

Utilize the helper in the blockdev DIRECT_IO code.

Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2019-02-24 23:20:17 +0800
eae83ce10 block: wire up block device iopoll method ... Browse Code »

Just call blk_poll on the iocb cookie, we can derive the block device
from the inode trivially.

Reviewed-by: Hannes Reinecke
Reviewed-by: Johannes Thumshirn
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-02-24 23:20:17 +0800

15 Feb, 2019

1 commit

6dc4f100c block: allow bio_for_each_segment_all() to iterate over multi-page bvec ... Browse Code »

This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Omar Sandoval
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-02-15 23:40:11 +0800

15 Jan, 2019

1 commit

04906b2f5 blockdev: Fix livelocks on loop device ... Browse Code »

bd_set_size() updates also block device's block size. This is somewhat
unexpected from its name and at this point, only blkdev_open() uses this
functionality. Furthermore, this can result in changing block size under
a filesystem mounted on a loop device which leads to livelocks inside
__getblk_gfp() like:

Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 10863 Comm: syz-executor0 Not tainted 4.18.0-rc5+ #151
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
RIP: 0010:__sanitizer_cov_trace_pc+0x3f/0x50 kernel/kcov.c:106
...
Call Trace:
init_page_buffers+0x3e2/0x530 fs/buffer.c:904
grow_dev_page fs/buffer.c:947 [inline]
grow_buffers fs/buffer.c:1009 [inline]
__getblk_slow fs/buffer.c:1036 [inline]
__getblk_gfp+0x906/0xb10 fs/buffer.c:1313
__bread_gfp+0x2d/0x310 fs/buffer.c:1347
sb_bread include/linux/buffer_head.h:307 [inline]
fat12_ent_bread+0x14e/0x3d0 fs/fat/fatent.c:75
fat_ent_read_block fs/fat/fatent.c:441 [inline]
fat_alloc_clusters+0x8ce/0x16e0 fs/fat/fatent.c:489
fat_add_cluster+0x7a/0x150 fs/fat/inode.c:101
__fat_get_block fs/fat/inode.c:148 [inline]
...

Trivial reproducer for the problem looks like:

truncate -s 1G /tmp/image
losetup /dev/loop0 /tmp/image
mkfs.ext4 -b 1024 /dev/loop0
mount -t ext4 /dev/loop0 /mnt
losetup -c /dev/loop0
l /mnt

Fix the problem by moving initialization of a block device block size
into a separate function and call it when needed.

Thanks to Tetsuo Handa for help with
debugging the problem.

Reported-by: syzbot+9933e4476f365f5d5a1b@syzkaller.appspotmail.com
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2019-01-15 22:30:56 +0800

03 Jan, 2019

1 commit

1ac5cd497 block: don't use un-ordered __set_current_state(TASK_UNINTERRUPTIBLE) ... Browse Code »

This mostly reverts commit 849a370016a5 ("block: avoid ordered task
state change for polled IO"). It was wrongly claiming that the ordering
wasn't necessary. The memory barrier _is_ necessary.

If something is truly polling and not going to sleep, it's the whole
state setting that is unnecessary, not the memory barrier. Whenever you
set your state to a sleeping state, you absolutely need the memory
barrier.

Note that sometimes the memory barrier can be elsewhere. For example,
the ordering might be provided by an external lock, or by setting the
process state to sleeping before adding yourself to the wait queue list
that is used for waking up (where the wait queue lock itself will
guarantee that any wakeup will correctly see the sleeping state).

But none of those cases were true here.

NOTE! Some of the polling paths may indeed be able to drop the state
setting entirely, at which point the memory barrier also goes away.

(Also note that this doesn't revert the TASK_RUNNING cases: there is no
race between a wakeup and setting the process state to TASK_RUNNING,
since the end result doesn't depend on ordering).

Cc: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Linus Torvalds

Linus Torvalds
2019-01-03 02:46:03 +0800

29 Dec, 2018

2 commits

f346b0bec Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc updates from Andrew Morton:

- large KASAN update to use arm's "software tag-based mode"

- a few misc things

- sh updates

- ocfs2 updates

- just about all of MM

* emailed patches from Andrew Morton : (167 commits)
kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
memcg, oom: notify on oom killer invocation from the charge path
mm, swap: fix swapoff with KSM pages
include/linux/gfp.h: fix typo
mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
memory_hotplug: add missing newlines to debugging output
mm: remove __hugepage_set_anon_rmap()
include/linux/vmstat.h: remove unused page state adjustment macro
mm/page_alloc.c: allow error injection
mm: migrate: drop unused argument of migrate_page_move_mapping()
blkdev: avoid migration stalls for blkdev pages
mm: migrate: provide buffer_migrate_page_norefs()
mm: migrate: move migrate_page_lock_buffers()
mm: migrate: lock buffers before migrate_page_move_mapping()
mm: migration: factor out code to compute expected number of page references
mm, page_alloc: enable pcpu_drain with zone capability
kmemleak: add config to select auto scan
mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
...

Linus Torvalds
2018-12-29 08:55:46 +0800
88dbcbb3a blkdev: avoid migration stalls for blkdev pages ... Browse Code »

Currently, block device pages don't provide a ->migratepage callback and
thus fallback_migrate_page() is used for them. This handler cannot deal
with dirty pages in async mode and also with the case a buffer head is in
the LRU buffer head cache (as it has elevated b_count). Thus such page
can block memory offlining.

Fix the problem by using buffer_migrate_page_norefs() for migrating block
device pages. That function takes care of dropping bh LRU in case
migration would fail due to elevated buffer refcount to avoid stalls and
can also migrate dirty pages without writing them.

Link: http://lkml.kernel.org/r/20181211172143.7358-6-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Mel Gorman
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2018-12-29 04:11:51 +0800

30 Nov, 2018

1 commit

531724abc block: avoid extra bio reference for async O_DIRECT ... Browse Code »

The bio referencing has a trick that doesn't do any actual atomic
inc/dec on the reference count until we have to elevator to > 1. For the
async IO O_DIRECT case, we can't use the simple DIO variants, so we use
__blkdev_direct_IO(). It always grabs an extra reference to the bio
after allocation, which means we then enter the slower path of actually
having to do atomic_inc/dec on the count.

We don't need to do that for the async case, unless we end up going
multi-bio, in which case we're already doing huge amounts of IO. For the
smaller IO case (< BIO_MAX_PAGES), we can do without the extra ref.

Based on an earlier patch (and commit log) from Jens Axboe.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-11-30 23:28:51 +0800

26 Nov, 2018

1 commit

0a1b8b87d block: make blk_poll() take a parameter on whether to spin or not ... Browse Code »

blk_poll() has always kept spinning until it found an IO. This is
fine for SYNC polling, since we need to find one request we have
pending, but in preparation for ASYNC polling it can be beneficial
to just check if we have any entries available or not.

Existing callers are converted to pass in 'spin == true', to retain
the old behavior.

Signed-off-by: Jens Axboe

Jens Axboe
2018-11-26 23:25:53 +0800

19 Nov, 2018

1 commit

849a37001 block: avoid ordered task state change for polled IO ... Browse Code »

For the core poll helper, the task state setting don't need to imply any
atomics, as it's the current task itself that is being modified and
we're not going to sleep.

For IRQ driven, the wakeup path have the necessary barriers to not need
us using the heavy handed version of the task state setting.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2018-11-19 23:34:49 +0800

16 Nov, 2018

2 commits

cb700eb3f block: don't plug for aio/O_DIRECT HIPRI IO ... Browse Code »

Those will go straight to issue inside blk-mq, so don't bother
setting up a block plug for them.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2018-11-16 23:35:10 +0800
d34513d38 block: for async O_DIRECT, mark us as polling if asked to ... Browse Code »

Inherit the iocb IOCB_HIPRI flag, and pass on REQ_HIPRI for
those kinds of requests.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2018-11-16 23:34:59 +0800