Eric Lee / smarc-fsl-linux-kernel

13 May, 2017

1 commit

0fcc3ab23 Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm fixes from Dan Williams:
"Incremental fixes and a small feature addition on top of the main
libnvdimm 4.12 pull request:

- Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
The size regression is fixed by moving all dax helpers into the
dax-core and only specifying "select DAX" for FS_DAX and
dax-capable drivers. He also asked for clarification of the
NR_DEV_DAX config option which, on closer look, does not need to be
a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
for good measure.

- Ben's attention to detail on -stable patch submissions caught a
case where the recent fixes to arch_copy_from_iter_pmem() missed a
condition where we strand dirty data in the cache. This is tagged
for -stable and will also be included in the rework of the pmem api
to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

- Vishal adds a feature that missed the initial pull due to pending
review feedback. It allows the kernel to clear media errors when
initializing a BTT (atomic sector update driver) instance on a pmem
namespace.

- Ross noticed that the dax_device + dax_operations conversion broke
__dax_zero_page_range(). The nvdimm unit tests fail to check this
path, but xfstests immediately trips over it. No excuse for missing
this before submitting the 4.12 pull request.

These all pass the nvdimm unit tests and an xfstests spot check. The
set has received a build success notification from the kbuild robot"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
filesystem-dax: fix broken __dax_zero_page_range() conversion
libnvdimm, btt: ensure that initializing metadata clears poison
libnvdimm: add an atomic vs process context flag to rw_bytes
x86, pmem: Fix cache flushing for iovec write < 8 bytes
device-dax: kill NR_DEV_DAX
block, dax: move "select DAX" from BLOCK to FS_DAX
device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX

Linus Torvalds
2017-05-13 06:43:10 +0800

09 May, 2017

1 commit

ef5104247 block, dax: move "select DAX" from BLOCK to FS_DAX ... Browse Code »

For configurations that do not enable DAX filesystems or drivers, do not
require the DAX core to be built.

Given that the 'direct_access' method has been removed from
'block_device_operations', we can also go ahead and remove the
block-related dax helper functions from fs/block_dev.c to
drivers/dax/super.c. This keeps dax details out of the block layer and
lets the DAX core be built as a module in the FS_DAX=n case.

Filesystems need to include dax.h to call bdev_dax_supported().

Cc: linux-xfs@vger.kernel.org
Cc: Jens Axboe
Cc: "Theodore Ts'o"
Cc: Matthew Wilcox
Cc: Alexander Viro
Cc: "Darrick J. Wong"
Cc: Ross Zwisler
Reviewed-by: Jan Kara
Reported-by: Geert Uytterhoeven
Signed-off-by: Dan Williams

Dan Williams
2017-05-09 01:55:27 +0800

06 May, 2017

1 commit

53ef7d0e2 Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Dan Williams:
"The bulk of this has been in multiple -next releases. There were a few
late breaking fixes and small features that got added in the last
couple days, but the whole set has received a build success
notification from the kbuild robot.

Change summary:

- Region media error reporting: A libnvdimm region device is the
parent to one or more namespaces. To date, media errors have been
reported via the "badblocks" attribute attached to pmem block
devices for namespaces in "raw" or "memory" mode. Given that
namespaces can be in "device-dax" or "btt-sector" mode this new
interface reports media errors generically, i.e. independent of
namespace modes or state.

This subsequently allows userspace tooling to craft "ACPI 6.1
Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
requests and submit them via the ioctl path for NVDIMM root bus
devices.

- Introduce 'struct dax_device' and 'struct dax_operations': Prompted
by a request from Linus and feedback from Christoph this allows for
dax capable drivers to publish their own custom dax operations.
This fixes the broken assumption that all dax operations are
related to a persistent memory device, and makes it easier for
other architectures and platforms to add customized persistent
memory support.

- 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
available for storage appliance applications to manually trigger
memory controllers to drain write-pending buffers that would
otherwise be flushed automatically by the platform ADR
(asynchronous-DRAM-refresh) mechanism at a power loss event.
Support for "locked" DIMMs is included to prevent namespaces from
surfacing when the namespace label data area is locked. Finally,
fixes for various reported deadlocks and crashes, also tagged for
-stable.

- ACPI / nfit driver updates: General updates of the nfit driver to
add DSM command overrides, ACPI 6.1 health state flags support, DSM
payload debug available by default, and various fixes.

Acknowledgements that came after the branch was pushed:

- commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
Tested-by: Yi Zhang

- commit 23f498448362 "libnvdimm: rework region badblocks clearing"
Tested-by: Toshi Kani "

* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
libnvdimm, pfn: fix 'npfns' vs section alignment
libnvdimm: handle locked label storage areas
libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
brd: fix uninitialized use of brd->dax_dev
block, dax: use correct format string in bdev_dax_supported
device-dax: fix sysfs attribute deadlock
libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
libnvdimm: rework region badblocks clearing
acpi, nfit: kill ACPI_NFIT_DEBUG
libnvdimm: fix clear length of nvdimm_forget_poison()
libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
libnvdimm, region: sysfs trigger for nvdimm_flush()
libnvdimm: fix phys_addr for nvdimm_clear_poison
x86, dax, pmem: remove indirection around memcpy_from_pmem()
block: remove block_device_operations ->direct_access()
block, dax: convert bdev_dax_supported() to dax_direct_access()
filesystem-dax: convert to dax_direct_access()
Revert "block: use DAX for partition table reads"
ext2, ext4, xfs: retrieve dax_device for iomap operations
...

Linus Torvalds
2017-05-06 09:49:20 +0800

04 May, 2017

1 commit

a5f6a6a9c fs/block_dev: always invalidate cleancache in invalidate_bdev() ... Browse Code »

invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
which doen't make any sense.

Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
regardless of mapping->nrpages value.

Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Reviewed-by: Jan Kara
Acked-by: Konrad Rzeszutek Wilk
Cc: Alexander Viro
Cc: Ross Zwisler
Cc: Jens Axboe
Cc: Johannes Weiner
Cc: Alexey Kuznetsov
Cc: Christoph Hellwig
Cc: Nikolay Borisov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2017-05-04 06:52:12 +0800

02 May, 2017

1 commit

67fd38973 block, dax: use correct format string in bdev_dax_supported ... Browse Code »

The new message has an incorrect format string, causing a warning in some
configurations:

fs/block_dev.c: In function 'bdev_dax_supported':
fs/block_dev.c:779:5: error: format '%d' expects argument of type 'int', but argument 2 has type 'long int' [-Werror=format=]
"error: dax access failed (%d)", len);

This changes it to use the correct %ld instead of %d.

Fixes: 2093f2e9dfec ("block, dax: convert bdev_dax_supported() to dax_direct_access()")
Signed-off-by: Arnd Bergmann
Signed-off-by: Dan Williams

Arnd Bergmann
2017-05-02 04:16:29 +0800

26 Apr, 2017

2 commits

d4b29fd78 block: remove block_device_operations ->direct_access() ... Browse Code »

Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.

Signed-off-by: Dan Williams

Dan Williams
2017-04-26 04:20:46 +0800
2093f2e9d block, dax: convert bdev_dax_supported() to dax_direct_access() ... Browse Code »

Kill of the final user of bdev_direct_access() and struct blk_dax_ctl.

Signed-off-by: Dan Williams

Dan Williams
2017-04-26 04:20:46 +0800

22 Apr, 2017

1 commit

19b7ccf86 block: get rid of blk_integrity_revalidate() ... Browse Code »

Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
introduced blk_integrity_revalidate(), which seems to assume ownership
of the stable pages flag and unilaterally clears it if no blk_integrity
profile is registered:

if (bi->profile)
disk->queue->backing_dev_info->capabilities |=
BDI_CAP_STABLE_WRITES;
else
disk->queue->backing_dev_info->capabilities &=
~BDI_CAP_STABLE_WRITES;

It's called from revalidate_disk() and rescan_partitions(), making it
impossible to enable stable pages for drivers that support partitions
and don't use blk_integrity: while the call in revalidate_disk() can be
trivially worked around (see zram, which doesn't support partitions and
hence gets away with zram_revalidate_disk()), rescan_partitions() can
be triggered from userspace at any time. This breaks rbd, where the
ceph messenger is responsible for generating/verifying CRCs.

Since blk_integrity_{un,}register() "must" be used for (un)registering
the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
setting there. This way drivers that call blk_integrity_register() and
use integrity infrastructure won't interfere with drivers that don't
but still want stable pages.

Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
Cc: "Martin K. Petersen"
Cc: Christoph Hellwig
Cc: Mike Snitzer
Cc: stable@vger.kernel.org # 4.4+, needs backporting
Tested-by: Dan Williams
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe

Ilya Dryomov
2017-04-22 04:17:27 +0800

21 Apr, 2017

2 commits

b0686260f dax: introduce dax_direct_access() ... Browse Code »

Replace bdev_direct_access() with dax_direct_access() that uses
dax_device and dax_operations instead of a block_device and
block_device_operations for dax. Once all consumers of the old api have
been converted bdev_direct_access() will be deleted.

Given that block device partitioning decisions can cause dax page
alignment constraints to be violated this also introduces the
bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
to the dax_device and also checks for page alignment.

Signed-off-by: Dan Williams

Dan Williams
2017-04-21 02:57:52 +0800
d8f07aee3 block: kill bdev_dax_capable() ... Browse Code »

This is leftover dead code that has since been replaced by
bdev_dax_supported().

Signed-off-by: Dan Williams

Dan Williams
2017-04-21 02:57:52 +0800

09 Apr, 2017

2 commits

34045129b block_dev: use blkdev_issue_zerout for hole punches ... Browse Code »

This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
and prepares for removing the discard_zeroes_data flag.

Also remove a pointless discard support check, which is done in
blkdev_issue_discard already.

Signed-off-by: Christoph Hellwig
Reviewed-by: Martin K. Petersen

Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-04-09 01:25:38 +0800
ee472d835 block: add a flags argument to (__)blkdev_issue_zeroout ... Browse Code »

Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.

Signed-off-by: Christoph Hellwig
Reviewed-by: Martin K. Petersen
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-04-09 01:25:38 +0800

23 Mar, 2017

2 commits

f759741d9 block: Fix oops in locked_inode_to_wb_and_lock_list() ... Browse Code »

When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.

The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment.

Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.

Reported-by: Thiago Jung Bauermann
Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-03-23 10:11:33 +0800
03e262798 block: Fix bdi assignment to bdev inode when racing with disk delete ... Browse Code »

When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
restart the process of opening the block device. However we forget to
switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
inode will be pointing to a stale bdi. Fix the problem by setting
bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.

Acked-by: Tejun Heo
Reviewed-by: Hannes Reinecke
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-03-23 10:11:22 +0800

02 Mar, 2017

1 commit

a5a79d000 block: Initialize bd_bdi on inode initialization ... Browse Code »

So far we initialized bd_bdi only in bdget(). That is fine for normal
bdev inodes however for the special case of the root inode of
blockdev_superblock that function is never called and thus bd_bdi is
left uninitialized. As a result bdev_evict_inode() may oops doing
bdi_put(root->bd_bdi) on that inode as can be seen when doing:

mount -t bdev none /mnt

Fix the problem by initializing bd_bdi when first allocating the inode
and then reinitializing bd_bdi in bdev_evict_inode().

Thanks to syzkaller team for finding the problem.

Reported-by: Dmitry Vyukov
Fixes: b1d2dc5659b4 ("block: Make blk_get_backing_dev_info() safe without open bdev")
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-03-02 23:56:59 +0800

28 Feb, 2017

1 commit

93407472a fs: add i_blocksize() ... Browse Code »

Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
branch.

This patch also fixes multiple checkpatch warnings: WARNING: Prefer
'unsigned int' to bare use of 'unsigned'

Thanks to Andrew Morton for suggesting more appropriate function instead
of macro.

[geliangtang@gmail.com: truncate: use i_blocksize()]
Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick
Signed-off-by: Geliang Tang
Cc: Alexander Viro
Cc: Ross Zwisler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2017-02-28 10:43:46 +0800

22 Feb, 2017

1 commit

cccd9fb9e block: Revalidate i_bdev reference in bd_aquire() ... Browse Code »

When a device gets removed, block device inode unhashed so that it is not
used anymore (bdget() will not find it anymore). Later when a new device
gets created with the same device number, we create new block device
inode. However there may be file system device inodes whose i_bdev still
points to the original block device inode and thus we get two active
block device inodes for the same device. They will share the same
gendisk so the only visible differences will be that page caches will
not be coherent and BDIs will be different (the old block device inode
still points to unregistered BDI).

Fix the problem by checking in bd_acquire() whether i_bdev still points
to active block device inode and re-lookup the block device if not. That
way any open of a block device happening after the old device has been
removed will get correct block device inode.

Tested-by: Lekshmi Pillai
Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-02-22 03:51:54 +0800

18 Feb, 2017

1 commit

818551e2b Merge branch 'for-4.11/next' into for-4.11/linus-merge ... Browse Code »

Signed-off-by: Jens Axboe

Jens Axboe
2017-02-18 05:08:19 +0800

02 Feb, 2017

2 commits

b1d2dc565 block: Make blk_get_backing_dev_info() safe without open bdev ... Browse Code »

Currenly blk_get_backing_dev_info() is not safe to be called when the
block device is not open as bdev->bd_disk is NULL in that case. However
inode_to_bdi() uses this function and may be call called from flusher
worker or other writeback related functions without bdev being open
which leads to crashes such as:

[113031.075540] Unable to handle kernel paging request for data at address 0x00000000
[113031.075614] Faulting instruction address: 0xc0000000003692e0
0:mon> t
[c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
[c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
[c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
[c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
[c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
[c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
[c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
[c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c

Signed-off-by: Jens Axboe

Jan Kara
2017-02-02 23:20:53 +0800
f44f1ab5a block: Unhash block device inodes on gendisk destruction ... Browse Code »

Currently, block device inodes stay around after corresponding gendisk
hash died until memory reclaim finds them and frees them. Since we will
make block device inode pin the bdi, we want to free the block device
inode as soon as the device goes away so that bdi does not stay around
unnecessarily. Furthermore we need to avoid issues when new device with
the same major,minor pair gets created since reusing the bdi structure
would be rather difficult in this case.

Unhashing block device inode on gendisk destruction nicely deals with
these problems. Once last block device inode reference is dropped (which
may be directly in del_gendisk()), the inode gets evicted. Furthermore if
the major,minor pair gets reallocated, we are guaranteed to get new
block device inode even if old block device inode is not yet evicted and
thus we avoid issues with possible reuse of bdi.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-02-02 23:18:41 +0800

24 Jan, 2017

1 commit

690e5325b block: fix use after free in __blkdev_direct_IO ... Browse Code »

We can't dereference the dio structure after submitting the last bio for
this request, as I/O completion might have happened before the code is
run. Introduce a local is_sync variable instead.

Fixes: 542ff7bf ("block: new direct I/O implementation")
Signed-off-by: Christoph Hellwig
Reported-by: Matias Bjørling
Tested-by: Matias Bjørling
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-01-24 22:55:53 +0800

05 Jan, 2017

1 commit

62f8c4059 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer fixes from Jens Axboe:
"A set of fixes for the current series, one fixing a regression with
block size < page cache size in the alias series from Jan. Outside of
that, two small cleanups for wbt from Bart, a nvme pull request from
Christoph, and a few small fixes of documentation updates"

* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix up io_poll documentation
block: Avoid that sparse complains about context imbalance in __wbt_wait()
block: Make wbt_wait() definition consistent with declaration
clean_bdev_aliases: Prevent cleaning blocks that are not in block range
genhd: remove dead and duplicated scsi code
block: add back plugging in __blkdev_direct_IO
nvmet/fcloop: remove some logically dead code performing redundant ret checks
nvmet: fix KATO offset in Set Features
nvme/fc: simplify error handling of nvme_fc_create_hw_io_queues
nvme/fc: correct some printk information
nvme/scsi: Remove START STOP emulation
nvme/pci: Delete misleading queue-wrap comment
nvme/pci: Fix whitespace problem
nvme: simplify stripe quirk
nvme: update maintainers information

Linus Torvalds
2017-01-05 01:03:37 +0800

25 Dec, 2016

1 commit

7c0f6ba68 Replace <asm/uaccess.h> with <linux/uaccess.h> globally ... Browse Code »

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*'
sed -i -e "s!$PATT!#include !" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-12-25 03:46:01 +0800

23 Dec, 2016

1 commit

64d656a16 block: add back plugging in __blkdev_direct_IO ... Browse Code »

This allows sending larger than 1 MB requests to devices that support
large I/O sizes.

Signed-off-by: Christoph Hellwig
Reported-by: Laurence Oberman
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-12-23 02:48:20 +0800

14 Dec, 2016

2 commits

7a62a5233 block_dev: don't update file access position for sync direct IO ... Browse Code »

For sync direct IO, generic_file_direct_write/generic_file_read_iter
will update file access position. Don't duplicate the update in
.direct_IO. This cause my raid array can't assemble.

Cc: Christoph Hellwig
Cc: Jens Axboe
Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2016-12-14 12:07:08 +0800
bcc7f5b4b block_dev: don't test bdev->bd_contains when it is not stable ... Browse Code »

bdev->bd_contains is not stable before calling __blkdev_get().
When __blkdev_get() is called on a parition with ->bd_openers == 0
it sets
bdev->bd_contains = bdev;
which is not correct for a partition.
After a call to __blkdev_get() succeeds, ->bd_openers will be > 0
and then ->bd_contains is stable.

When FMODE_EXCL is used, blkdev_get() calls
bd_start_claiming() -> bd_prepare_to_claim() -> bd_may_claim()

This call happens before __blkdev_get() is called, so ->bd_contains
is not stable. So bd_may_claim() cannot safely use ->bd_contains.
It currently tries to use it, and this can lead to a BUG_ON().

This happens when a whole device is already open with a bd_holder (in
use by dm in my particular example) and two threads race to open a
partition of that device for the first time, one opening with O_EXCL and
one without.

The thread that doesn't use O_EXCL gets through blkdev_get() to
__blkdev_get(), gains the ->bd_mutex, and sets bdev->bd_contains = bdev;

Immediately thereafter the other thread, using FMODE_EXCL, calls
bd_start_claiming() from blkdev_get(). This should fail because the
whole device has a holder, but because bdev->bd_contains == bdev
bd_may_claim() incorrectly reports success.
This thread continues and blocks on bd_mutex.

The first thread then sets bdev->bd_contains correctly and drops the mutex.
The thread using FMODE_EXCL then continues and when it calls bd_may_claim()
again in:
BUG_ON(!bd_may_claim(bdev, whole, holder));
The BUG_ON fires.

Fix this by removing the dependency on ->bd_contains in
bd_may_claim(). As bd_may_claim() has direct access to the whole
device, it can simply test if the target bdev is the whole device.

Fixes: 6b4517a7913a ("block: implement bd_claiming and claiming block")
Cc: stable@vger.kernel.org (v2.6.35+)
Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2016-12-14 12:07:08 +0800

01 Dec, 2016

1 commit

af309226d block: protect iterate_bdevs() against concurrent close ... Browse Code »

If a block device is closed while iterate_bdevs() is handling it, the
following NULL pointer dereference occurs because bdev->b_disk is NULL
in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
turn called by the mapping_cap_writeback_dirty() call in
__filemap_fdatawrite_range()):

BUG: unable to handle kernel NULL pointer dereference at 0000000000000508
IP: [] blk_get_backing_dev_info+0x10/0x20
PGD 9e62067 PUD 9ee8067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
task: ffff880009f4d700 ti: ffff880009f5c000 task.ti: ffff880009f5c000
RIP: 0010:[] [] blk_get_backing_dev_info+0x10/0x20
RSP: 0018:ffff880009f5fe68 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88000ec17a38 RCX: ffffffff81a4e940
RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff88000ec176c0
RBP: ffff880009f5fe68 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88000ec17860
R13: ffffffff811b25c0 R14: ffff88000ec178e0 R15: ffff88000ec17a38
FS: 00007faee505d700(0000) GS:ffff88000fb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000508 CR3: 0000000009e8a000 CR4: 00000000000006e0
Stack:
ffff880009f5feb8 ffffffff8112e7f5 0000000000000000 7fffffffffffffff
0000000000000000 0000000000000000 7fffffffffffffff 0000000000000001
ffff88000ec178e0 ffff88000ec17860 ffff880009f5fec8 ffffffff8112e81f
Call Trace:
[] __filemap_fdatawrite_range+0x85/0x90
[] filemap_fdatawrite+0x1f/0x30
[] fdatawrite_one_bdev+0x16/0x20
[] iterate_bdevs+0xf2/0x130
[] sys_sync+0x63/0x90
[] entry_SYSCALL_64_fastpath+0x12/0x76
Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 8b 80 08 05 00 00 5d
RIP [] blk_get_backing_dev_info+0x10/0x20
RSP
CR2: 0000000000000508
---[ end trace 2487336ceb3de62d ]---

The crash is easily reproducible by running the following command, if an
msleep(100) is inserted before the call to func() in iterate_devs():

while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done

Fix it by holding the bd_mutex across the func() call and only calling
func() if the bdev is opened.

Cc: stable@vger.kernel.org
Fixes: 5c0d6b60a0ba ("vfs: Create function for iterating over block devices")
Reported-and-tested-by: Wei Fang
Signed-off-by: Rabin Vincent
Signed-off-by: Jan Kara
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Rabin Vincent
2016-12-01 23:26:39 +0800

22 Nov, 2016

3 commits

3a83f4677 block: bio: pass bvec table to bio_init() ... Browse Code »

Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.

After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.

Signed-off-by: Ming Lei
Reviewed-by: Christoph Hellwig

Fixed up the new O_DIRECT cases.

Signed-off-by: Jens Axboe

Ming Lei
2016-11-22 23:57:21 +0800
9a794fb9b block_dev: get rid of blksize bits calculation ... Browse Code »

We store the bits in the bdev sector size locally, but we don't use
the calculation anymore. All we do with it is shift it back up to
the bdev sector size. So let's just use that directly and kill the
variable and bits calculation.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-22 23:56:25 +0800
4d1a47654 block_dev: Fixed direct I/O bio sector calculation ... Browse Code »

A direct I/O alignment must be always checked against the device blocks size,
but the I/O offset (bio->bi_iter.bi_sector must always use 512B sector unit, and
not the actual logical block size.

Signed-off-by: Damien Le Moal
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Damien Le Moal
2016-11-22 23:09:09 +0800

18 Nov, 2016

4 commits

542ff7bf1 block: new direct I/O implementation ... Browse Code »

Similar to the simple fast path, but we now need a dio structure to
track multiple-bio completions. It's basically a cut-down version
of the new iomap-based direct I/O code for filesystems, but without
all the logic to call into the filesystem for extent lookup or
allocation, and without the complex I/O completion workqueue handler
for AIO - instead we just use the FUA bit on the bios to ensure
data is flushed to stable storage.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-11-18 04:35:11 +0800
78250c02d block: make __blkdev_direct_IO_sync() support O_SYNC/DSYNC ... Browse Code »

Split the op setting code into a helper, use it in both places.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-18 04:35:05 +0800
72ecad22d block: support a full bio worth of IO for simplified bdev direct-io ... Browse Code »

Just alloc the bio_vec array if we exceed the inline limit.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-18 04:35:02 +0800
189ce2b9d block: fast-path for small and simple direct I/O requests ... Browse Code »

This patch adds a small and simple fast patch for small direct I/O
requests on block devices that don't use AIO. Between the neat
bio_iov_iter_get_pages helper that avoids allocating a page array
for get_user_pages and the on-stack bio and biovec this avoid memory
allocations and atomic operations entirely in the direct I/O code
(lower levels might still do memory allocations and will usually
have at least some atomic operations, though).

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
Tested-By: Stephen Bates
Reviewed-By: Stephen Bates

Christoph Hellwig
2016-11-18 04:34:45 +0800

12 Oct, 2016

1 commit

25f4c4141 block: implement (some of) fallocate for block devices ... Browse Code »

After much discussion, it seems that the fallocate feature flag
FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been whitelisted
for zeroing SCSI UNMAP. Punch still requires that FALLOC_FL_KEEP_SIZE is
set. A length that goes past the end of the device will be clamped to the
device size if KEEP_SIZE is set; or will return -EINVAL if not. Both
start and length must be aligned to the device's logical block size.

Since the semantics of fallocate are fairly well established already, wire
up the two pieces. The other fallocate variants (collapse range, insert
range, and allocate blocks) are not supported.

Link: http://lkml.kernel.org/r/147518379992.22791.8849838163218235007.stgit@birch.djwong.org
Signed-off-by: Darrick J. Wong
Reviewed-by: Hannes Reinecke
Reviewed-by: Bart Van Assche
Cc: Theodore Ts'o
Cc: Martin K. Petersen
Cc: Mike Snitzer # tweaked header
Cc: Brian Foster
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2016-10-12 06:06:30 +0800

06 Oct, 2016

1 commit

997198ba1 fs/block_dev.c: return the right error in thaw_bdev() ... Browse Code »

When triggering thaw-filesystems via magic sysrq, the system enters a
loop in do_thaw_one(), as thaw_bdev() still returns success if
bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
error (and simplify the code a bit at the same time).

Reviewed-by: Eric Farman
Reviewed-by: Cornelia Huck
Signed-off-by: Pierre Morel
Reviewed-by: Jan Kara
Signed-off-by: Jens Axboe

Pierre Morel
2016-10-06 04:35:13 +0800

14 Sep, 2016

1 commit

223757016 block_dev: remove DAX leftovers ... Browse Code »

DAX support for block devices was removed in commits 03cdad
("block: disable block device DAX by default") and 99a01cd
("block: remove BLK_DEV_DAX config option"), but we still kept a call to
dax_do_io and some uneeded i_flags manipulations introduced in commit
bbab37 ("block: Add support for DAX reads/writes to block devices").

Remove those leftovers.

Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Acked-by: Dan Williams
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-09-14 22:41:59 +0800

25 Aug, 2016

1 commit

5bb53c0fb fs/block_dev: fix potential NULL ptr deref in freeze_bdev() ... Browse Code »

Calling freeze_bdev() twice on the same block device without mounted
filesystem get_super() will return NULL, which will lead to NULL-ptr
dereference later in drop_super().

Check get_super() result to fix that.

Note, that this is a purely theoretical issue. We have only 3
freeze_bdev() callers. 2 of them are in filesystem code and used on a
device with mounted fs. The third one in lock_fs() has protection in
upper-layer code against freezing block device the second time without
thawing it first.

Signed-off-by: Andrey Ryabinin
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Andrey Ryabinin
2016-08-25 22:38:26 +0800

22 Aug, 2016

1 commit

e9e5e3fae bdev: fix NULL pointer dereference ... Browse Code »

I got this:

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
CPU: 0 PID: 5505 Comm: syz-executor Not tainted 4.8.0-rc2+ #161
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
task: ffff880113415940 task.stack: ffff880118350000
RIP: 0010:[] [] bd_mount+0x52/0xa0
RSP: 0018:ffff880118357ca0 EFLAGS: 00010207
RAX: dffffc0000000000 RBX: ffffffffffffffff RCX: ffffc90000bb6000
RDX: 0000000000000018 RSI: ffffffff846d6b20 RDI: 00000000000000c7
RBP: ffff880118357cb0 R08: ffff880115967c68 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801188211e8
R13: ffffffff847baa20 R14: ffff8801139cb000 R15: 0000000000000080
FS: 00007fa3ff6c0700(0000) GS:ffff88011aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc1d8cc7e78 CR3: 0000000109f20000 CR4: 00000000000006f0
DR0: 000000000000001e DR1: 000000000000001e DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:
ffff880112cfd6c0 ffff8801188211e8 ffff880118357cf0 ffffffff8167f207
ffffffff816d7a1e ffff880112a413c0 ffffffff847baa20 ffff8801188211e8
0000000000000080 ffff880112cfd6c0 ffff880118357d38 ffffffff816dce0a
Call Trace:
[] mount_fs+0x97/0x2e0
[] ? alloc_vfsmnt+0x55e/0x760
[] vfs_kern_mount+0x7a/0x300
[] ? _raw_read_unlock+0x2c/0x50
[] do_mount+0x3d7/0x2730
[] ? trace_do_page_fault+0x1f4/0x3a0
[] ? copy_mount_string+0x40/0x40
[] ? memset+0x31/0x40
[] ? copy_mount_options+0x1ee/0x320
[] SyS_mount+0xb2/0x120
[] ? copy_mnt_ns+0x970/0x970
[] do_syscall_64+0x1c4/0x4e0
[] entry_SYSCALL64_slow_path+0x25/0x25
Code: 83 e8 63 1b fc ff 48 85 c0 48 89 c3 74 4c e8 56 35 d1 ff 48 8d bb c8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 3c 02 00 75 36 4c 8b a3 c8 00 00 00 48 b8 00 00 00 00 00 fc
RIP [] bd_mount+0x52/0xa0
RSP
---[ end trace 13690ad962168b98 ]---

mount_pseudo() returns ERR_PTR(), not NULL, on error.

Fixes: 3684aa7099e0 ("block-dev: enable writeback cgroup support")
Cc: Shaohua Li
Cc: Tejun Heo
Cc: Jens Axboe
Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum
Signed-off-by: Jens Axboe

Vegard Nossum
2016-08-22 22:06:15 +0800

08 Aug, 2016

1 commit

c11f0c0b5 block/mm: make bdev_ops->rw_page() take a bool for read/write ... Browse Code »

Commit abf545484d31 changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.

Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.

Reviewed-by: Mike Christie
Signed-off-by: Jens Axboe

Jens Axboe
2016-08-08 04:41:02 +0800