Eric Lee / smarc-fsl-linux-kernel

22 Apr, 2019

1 commit

6fcc44d1d block: fix use-after-free on gendisk ... Browse Code »

commit 2da78092dda "block: Fix dev_t minor allocation lifetime"
specifically moved blk_free_devt(dev->devt) call to part_release()
to avoid reallocating device number before the device is fully
shutdown.

However, it can cause use-after-free on gendisk in get_gendisk().
We use md device as example to show the race scenes:

Process1 Worker Process2
md_free
blkdev_open
del_gendisk
add delete_partition_work_fn() to wq
__blkdev_get
get_gendisk
put_disk
disk_release
kfree(disk)
find part from ext_devt_idr
get_disk_and_module(disk)
cause use after free

delete_partition_work_fn
put_device(part)
part_release
remove part from ext_devt_idr

Before is removed from ext_devt_idr by
delete_partition_work_fn(), we can find the devt and then access
gendisk by hd_struct pointer. But, if we access the gendisk after
it have been freed, it can cause in use-after-freeon gendisk in
get_gendisk().

We fix this by adding a new helper blk_invalidate_devt() in
delete_partition() and del_gendisk(). It replaces hd_struct
pointer in idr with value 'NULL', and deletes the entry from
idr in part_release() as we do now.

Thanks to Jan Kara for providing the solution and more clear comments
for the code.

Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
Cc: Al Viro
Reviewed-by: Bart Van Assche
Reviewed-by: Keith Busch
Reviewed-by: Jan Kara
Suggested-by: Jan Kara
Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe

Yufen Yu
2019-04-22 23:48:12 +0800

10 Dec, 2018

3 commits

e016b7820 block: return just one value from part_in_flight ... Browse Code »

The previous patches deleted all the code that needed the second value
returned from part_in_flight - now the kernel only uses the first value.

Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
it only returns one value.

This patch just refactors the code, there's no functional change.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Mikulas Patocka
2018-12-10 23:30:38 +0800
5b18b5a73 block: delete part_round_stats and switch to less precise counting ... Browse Code »

We want to convert to per-cpu in_flight counters.

The function part_round_stats needs the in_flight counter every jiffy, it
would be too costly to sum all the percpu variables every jiffy, so it
must be deleted. part_round_stats is used to calculate two counters -
time_in_queue and io_ticks.

time_in_queue can be calculated without part_round_stats, by adding the
duration of the I/O when the I/O ends (the value is almost as exact as the
previously calculated value, except that time for in-progress I/Os is not
counted).

io_ticks can be approximated by increasing the value when I/O is started
or ended and the jiffies value has changed. If the I/Os take less than a
jiffy, the value is as exact as the previously calculated value. If the
I/Os take more than a jiffy, io_ticks can drift behind the previously
calculated value.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Mikulas Patocka
2018-12-10 23:30:37 +0800
112f158f6 block: stop passing 'cpu' to all percpu stats methods ... Browse Code »

All of part_stat_* and related methods are used with preempt disabled,
so there is no need to pass cpu around to allow of them. Just call
smp_processor_id() as needed.

Suggested-by: Jens Axboe
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Mike Snitzer
2018-12-10 23:30:37 +0800

29 Nov, 2018

1 commit

94a2c3a32 block: use rcu_work instead of call_rcu to avoid sleep in softirq ... Browse Code »

We recently got a stack by syzkaller like this:

BUG: sleeping function called from invalid context at mm/slab.h:361
in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid
INFO: lockdep is turned off.
CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194
0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00
0000000000000000 0000000000000001 0000000000000004 0000000000000000
Call Trace:
[] __dump_stack lib/dump_stack.c:15 [inline]
[] dump_stack+0x114/0x1a0 lib/dump_stack.c:51
[] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675
[] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637
[] slab_pre_alloc_hook mm/slab.h:361 [inline]
[] slab_alloc_node mm/slub.c:2610 [inline]
[] slab_alloc mm/slub.c:2692 [inline]
[] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709
[] kmalloc include/linux/slab.h:479 [inline]
[] kzalloc include/linux/slab.h:623 [inline]
[] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227
[] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374
[] kobject_cleanup lib/kobject.c:633 [inline]
[] kobject_release+0x229/0x440 lib/kobject.c:675
[] kref_sub include/linux/kref.h:73 [inline]
[] kref_put include/linux/kref.h:98 [inline]
[] kobject_put+0x72/0xd0 lib/kobject.c:692
[] put_device+0x25/0x30 drivers/base/core.c:1237
[] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232
[] __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
[] rcu_do_batch kernel/rcu/tree.c:2705 [inline]
[] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline]
[] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline]
[] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957
[] __do_softirq+0x299/0xe20 kernel/softirq.c:273
[] invoke_softirq kernel/softirq.c:350 [inline]
[] irq_exit+0x216/0x2c0 kernel/softirq.c:391
[] exiting_irq arch/x86/include/asm/apic.h:652 [inline]
[] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926
[] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746
[] ? audit_kill_trees+0x180/0x180
[] fd_install+0x57/0x80 fs/file.c:626
[] do_sys_open+0x45e/0x550 fs/open.c:1043
[] SYSC_open fs/open.c:1055 [inline]
[] SyS_open+0x32/0x40 fs/open.c:1050
[] entry_SYSCALL_64_fastpath+0x1e/0x9a

In softirq context, we call rcu callback function delete_partition_rcu_cb(),
which may allocate memory by kzalloc with GFP_KERNEL flag. If the
allocation cannot be satisfied, it may sleep. However, That is not allowed
in softirq contex.

Although we found this problem on linux 4.4, the latest kernel version
seems to have this problem as well. And it is very similar to the
previous one:
https://lkml.org/lkml/2018/7/9/391

Fix it by using RCU workqueue, which allows sleep.

Reviewed-by: Paul E. McKenney
Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe

Yufen Yu
2018-11-29 00:08:27 +0800

22 Sep, 2018

1 commit

b57e99b4b block: use nanosecond resolution for iostat ... Browse Code »

Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.

Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.

Fixes: 522a777566f5 ("block: consolidate struct request timestamp fields")
Cc: stable@vger.kernel.org
Reported-by: Klaus Kusche
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-09-22 10:26:59 +0800

18 Jul, 2018

2 commits

bdca3c87f block: Track DISCARD statistics and output them in stat and diskstat ... Browse Code »

Add tracking of REQ_OP_DISCARD ios to the partition statistics and
append them to the various stat files in /sys as well as
/proc/diskstats. These are tracked with the same four stats as reads
and writes:

Number of discard ios completed.
Number of discard ios merged
Number of discard sectors completed
Milliseconds spent on discard requests

This is done via adding a new STAT_DISCARD define to genhd.h and then
using it to index that stat field for discard requests.

tj: Refreshed on top of v4.17 and other previous updates.

Signed-off-by: Michael Callahan
Signed-off-by: Tejun Heo
Cc: Andy Newell
Signed-off-by: Jens Axboe

Michael Callahan
2018-07-18 22:44:22 +0800
dbae2c551 block: Define and use STAT_READ and STAT_WRITE ... Browse Code »

Add defines for STAT_READ and STAT_WRITE for indexing the partition
stat entries. This clarifies some fs/ code which has hardcoded 1 for
STAT_WRITE and will make it easier to extend the stats with additional
fields.

tj: Refreshed on top of v4.17.

Signed-off-by: Michael Callahan
Signed-off-by: Tejun Heo
Cc: "Theodore Ts'o"
Cc: Jaegeuk Kim
Signed-off-by: Jens Axboe

Michael Callahan
2018-07-18 22:44:18 +0800

29 May, 2018

1 commit

5afb78356 block: don't print a message when the device went away ... Browse Code »

The information about a size change in this case just creates confusion.

Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-05-29 22:59:21 +0800

25 May, 2018

1 commit

5657a819a block drivers/block: Use octal not symbolic permissions ... Browse Code »

Convert the S_ symbolic permissions to their octal equivalents as
using octal and not symbolic permissions is preferred by many as more
readable.

see: https://lkml.org/lkml/2016/8/2/1945

Done with automated conversion via:
$ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace

Miscellanea:

o Wrapped modified multi-line calls to a single line where appropriate
o Realign modified multi-line calls to open parenthesis

Signed-off-by: Joe Perches
Signed-off-by: Jens Axboe

Joe Perches
2018-05-25 03:38:59 +0800

26 Apr, 2018

1 commit

bf0ddaba6 blk-mq: fix sysfs inflight counter ... Browse Code »

When the blk-mq inflight implementation was added, /proc/diskstats was
converted to use it, but /sys/block/$dev/inflight was not. Fix it by
adding another helper to count in-flight requests by data direction.

Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-04-26 23:02:01 +0800

01 Mar, 2018

1 commit

9c0fb1e31 block: display the correct diskname for bio ... Browse Code »

bio_devname use __bdevname to display the device name, and can
only show the major and minor of the part0,
Fix this by using disk_name to display the correct name.

Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Reviewed-by: Omar Sandoval
Reviewed-by: Christoph Hellwig
Signed-off-by: Jiufei Xue
Signed-off-by: Jens Axboe

Jiufei Xue
2018-03-01 23:41:25 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

25 Sep, 2017

1 commit

f5c156c4c block: fix a crash caused by wrong API ... Browse Code »

part_stat_show takes a part device not a disk, so we should use
part_to_disk.

Fixes: d62e26b3ffd2("block: pass in queue to inflight accounting")
Cc: Bart Van Assche
Cc: Omar Sandoval
Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2017-09-25 22:56:05 +0800

24 Aug, 2017

1 commit

47570848f block: remove blk_free_devt in add_partition ... Browse Code »

put_device(pdev) will call pdev->type->release finally, and blk_free_devt
has been called in part_release(), so remove it.

Signed-off-by: weiping zhang
Signed-off-by: Jens Axboe

weiping zhang
2017-08-24 22:26:57 +0800

18 Aug, 2017

1 commit

6d2cf6f2b genhd: Annotate all part and part_tbl pointer dereferences ... Browse Code »

Annotate gendisk.part_tbl and disk_part_tbl.part dereferences with
rcu_dereference_protected(). This patch does not change the behavior
of the modified code but ensures that sparse does not complain about
disk->part_tbl manipulations nor about part_tbl->part accesses.
Additionally, improve documentation of the locking requirements of
the modified functions.

Signed-off-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Cc: Tejun Heo
Cc: Jan Kara
Cc: Dan Williams
Cc: Christoph Hellwig
Signed-off-by: Jens Axboe

Bart Van Assche
2017-08-18 22:36:58 +0800

10 Aug, 2017

2 commits

0609e0efc block: make part_in_flight() take an array of two ints ... Browse Code »

Instead of returning the count that matches the partition, pass
in an array of two ints. Index 0 will be filled with the inflight
count for the partition in question, and index 1 will filled
with the root inflight count, if the partition passed in is not the
root.

This is in preparation for being able to calculate both in one
go.

Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2017-08-10 03:09:20 +0800
d62e26b3f block: pass in queue to inflight accounting ... Browse Code »

No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.

Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2017-08-10 03:09:16 +0800

23 May, 2017

1 commit

7bd897cfc block: fix an error code in add_partition() ... Browse Code »

We don't set an error code on this path. It means that we return NULL
instead of an error pointer and the caller does a NULL dereference.

Fixes: 6d1d8050b4bc ("block, partition: add partition_meta_info to hd_struct")
Signed-off-by: Dan Carpenter
Signed-off-by: Jens Axboe

Dan Carpenter
2017-05-23 22:41:59 +0800

06 May, 2017

1 commit

53ef7d0e2 Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Dan Williams:
"The bulk of this has been in multiple -next releases. There were a few
late breaking fixes and small features that got added in the last
couple days, but the whole set has received a build success
notification from the kbuild robot.

Change summary:

- Region media error reporting: A libnvdimm region device is the
parent to one or more namespaces. To date, media errors have been
reported via the "badblocks" attribute attached to pmem block
devices for namespaces in "raw" or "memory" mode. Given that
namespaces can be in "device-dax" or "btt-sector" mode this new
interface reports media errors generically, i.e. independent of
namespace modes or state.

This subsequently allows userspace tooling to craft "ACPI 6.1
Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
requests and submit them via the ioctl path for NVDIMM root bus
devices.

- Introduce 'struct dax_device' and 'struct dax_operations': Prompted
by a request from Linus and feedback from Christoph this allows for
dax capable drivers to publish their own custom dax operations.
This fixes the broken assumption that all dax operations are
related to a persistent memory device, and makes it easier for
other architectures and platforms to add customized persistent
memory support.

- 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
available for storage appliance applications to manually trigger
memory controllers to drain write-pending buffers that would
otherwise be flushed automatically by the platform ADR
(asynchronous-DRAM-refresh) mechanism at a power loss event.
Support for "locked" DIMMs is included to prevent namespaces from
surfacing when the namespace label data area is locked. Finally,
fixes for various reported deadlocks and crashes, also tagged for
-stable.

- ACPI / nfit driver updates: General updates of the nfit driver to
add DSM command overrides, ACPI 6.1 health state flags support, DSM
payload debug available by default, and various fixes.

Acknowledgements that came after the branch was pushed:

- commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
Tested-by: Yi Zhang

- commit 23f498448362 "libnvdimm: rework region badblocks clearing"
Tested-by: Toshi Kani "

* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
libnvdimm, pfn: fix 'npfns' vs section alignment
libnvdimm: handle locked label storage areas
libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
brd: fix uninitialized use of brd->dax_dev
block, dax: use correct format string in bdev_dax_supported
device-dax: fix sysfs attribute deadlock
libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
libnvdimm: rework region badblocks clearing
acpi, nfit: kill ACPI_NFIT_DEBUG
libnvdimm: fix clear length of nvdimm_forget_poison()
libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
libnvdimm, region: sysfs trigger for nvdimm_flush()
libnvdimm: fix phys_addr for nvdimm_clear_poison
x86, dax, pmem: remove indirection around memcpy_from_pmem()
block: remove block_device_operations ->direct_access()
block, dax: convert bdev_dax_supported() to dax_direct_access()
filesystem-dax: convert to dax_direct_access()
Revert "block: use DAX for partition table reads"
ext2, ext4, xfs: retrieve dax_device for iomap operations
...

Linus Torvalds
2017-05-06 09:49:20 +0800

26 Apr, 2017

1 commit

a41fe02b6 Revert "block: use DAX for partition table reads" ... Browse Code »

commit d1a5f2b4d8a1 ("block: use DAX for partition table reads") was
part of a stalled effort to allow dax mappings of block devices. Since
then the device-dax mechanism has filled the role of dax-mapping static
device ranges.

Now that we are moving ->direct_access() from a block_device operation
to a dax_inode operation we would need block devices to map and carry
their own dax_inode reference.

Unless / until we decide to revive dax mapping of raw block devices
through the dax_inode scheme, there is no need to carry
read_dax_sector(). Its removal in turn allows for the removal of
bdev_direct_access() and should have been included in commit
223757016837 ("block_dev: remove DAX leftovers").

Cc: Jeff Moyer
Signed-off-by: Dan Williams

Dan Williams
2017-04-26 04:20:46 +0800

22 Apr, 2017

1 commit

19b7ccf86 block: get rid of blk_integrity_revalidate() ... Browse Code »

Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
introduced blk_integrity_revalidate(), which seems to assume ownership
of the stable pages flag and unilaterally clears it if no blk_integrity
profile is registered:

if (bi->profile)
disk->queue->backing_dev_info->capabilities |=
BDI_CAP_STABLE_WRITES;
else
disk->queue->backing_dev_info->capabilities &=
~BDI_CAP_STABLE_WRITES;

It's called from revalidate_disk() and rescan_partitions(), making it
impossible to enable stable pages for drivers that support partitions
and don't use blk_integrity: while the call in revalidate_disk() can be
trivially worked around (see zram, which doesn't support partitions and
hence gets away with zram_revalidate_disk()), rescan_partitions() can
be triggered from userspace at any time. This breaks rbd, where the
ceph messenger is responsible for generating/verifying CRCs.

Since blk_integrity_{un,}register() "must" be used for (un)registering
the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
setting there. This way drivers that call blk_integrity_register() and
use integrity infrastructure won't interfere with drivers that don't
but still want stable pages.

Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
Cc: "Martin K. Petersen"
Cc: Christoph Hellwig
Cc: Mike Snitzer
Cc: stable@vger.kernel.org # 4.4+, needs backporting
Tested-by: Dan Williams
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe

Ilya Dryomov
2017-04-22 04:17:27 +0800

12 Jan, 2017

1 commit

f99e86485 block: Rename blk_queue_zone_size and bdev_zone_size ... Browse Code »

All block device data fields and functions returning a number of 512B
sectors are by convention named xxx_sectors while names in the form
xxx_size are generally used for a number of bytes. The blk_queue_zone_size
and bdev_zone_size functions were not following this convention so rename
them.

No functional change is introduced by this patch.

Signed-off-by: Damien Le Moal

Collapsed the two patches, they were nonsensically split and broke
bisection.

Signed-off-by: Jens Axboe

Damien Le Moal
2017-01-12 22:58:32 +0800

01 Dec, 2016

1 commit

b02d8aaea block: Check partition alignment on zoned block devices ... Browse Code »

Both blkdev_report_zones and blkdev_reset_zones can operate on a partition of
a zoned block device. However, the first and last zones reported for a
partition make sense only if the partition start sector and size are aligned
on the device zone size. The same applies for zone reset. Resetting the first
or the last zone of a partition straddling zones may impact neighboring
partitions. Finally, if a partition start sector is not at the beginning of a
sequential zone, it will be impossible to write to the first sectors of the
partition on a host-managed device.
Avoid all these problems and incoherencies by ignoring partitions that are not
zone aligned.

Note: Even with CONFIG_BLK_DEV_ZONED disabled, bdev_is_zoned() will report the
correct disk zoning type (host-aware, host-managed or none) but
bdev_zone_size() will always return 0 for zoned block devices (i.e. the zone
size is unknown). So test this as a way to ensure that a zoned block device is
being handled as such. As a result, for a host-aware devices, unaligned zone
partitions will be accepted with CONFIG_BLK_DEV_ZONED disabled. That is, the
disk will be treated as a regular block device (as it should). If zoned block
device support is enabled, only aligned partitions will be accepted.

Signed-off-by: Damien Le Moal
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Damien Le Moal
2016-12-01 22:56:53 +0800

14 Jun, 2016

1 commit

aa8d15bfe block/partition-generic.c: Remove a set-but-not-used variable ... Browse Code »

A value is assigned to the variable 'info' but that value is never
used. Hence remove the variable 'info'.

Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2016-06-14 23:09:15 +0800

16 Apr, 2016

1 commit

2e5725991 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"A few fixes for the current series. This contains:

- Two fixes for NVMe:

One fixes a reset race that can be triggered by repeated
insert/removal of the module.

The other fixes an issue on some platforms, where we get probe
timeouts since legacy interrupts isn't working. This used not to
be a problem since we had the worker thread poll for completions,
but since that was killed off, it means those poor souls can't
successfully probe their NVMe device. Use a proper IRQ check and
probe (msi-x -> msi ->legacy), like most other drivers to work
around this. Both from Keith.

- A loop corruption issue with offset in iters, from Ming Lei.

- A fix for not having the partition stat per cpu ref count
initialized before sending out the KOBJ_ADD, which could cause user
space to access the counter prior to initialization. Also from
Ming Lei.

- A fix for using the wrong congestion state, from Kaixu Xia"

* 'for-linus' of git://git.kernel.dk/linux-block:
block: loop: fix filesystem corruption in case of aio/dio
NVMe: Always use MSI/MSI-x interrupts
NVMe: Fix reset/remove race
writeback: fix the wrong congested state variable definition
block: partition: initialize percpuref before sending out KOBJ_ADD

Linus Torvalds
2016-04-16 06:44:10 +0800

05 Apr, 2016

1 commit

09cbfeaf1 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros ... Browse Code »

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-04-05 01:41:08 +0800

30 Mar, 2016

1 commit

b30a337ca block: partition: initialize percpuref before sending out KOBJ_ADD ... Browse Code »

The initialization of partition's percpu_ref should have been done before
sending out KOBJ_ADD uevent, which may cause userspace to read partition
table. So the uninitialized percpu_ref may be accessed in data path.

This patch fixes this issue reported by Naveen.

Reported-by: Naveen Kaje
Tested-by: Naveen Kaje
Fixes: 6c71013ecb7e2(block: partition: convert percpu ref)
Cc: # v4.3+
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2016-03-30 09:18:14 +0800

16 Mar, 2016

1 commit

0d9c51a6e block: partition: add partition specific uevent callbacks for partition info ... Browse Code »

This patch has been carried in the Android tree for quite some time and
is one of the few patches required to get a mainline kernel up and
running with an exsiting Android userspace. So I wanted to submit it
for review and consideration if it should be merged.

For partitions, add new uevent parameters 'PARTN' which specifies the
partitions index in the table, and 'PARTNAME', which specifies PARTNAME
specifices the partition name of a partition device.

Android's userspace uses this for creating device node links from the
partition name and number, ie:

/dev/block/platform/soc/by-name/system
or
/dev/block/platform/soc/by-num/p1

One can see its usage here:
https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#355
and
https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#494

[john.stultz@linaro.org: dropped NPARTS and reworded commit message for context]
Signed-off-by: Dima Zavin
Signed-off-by: John Stultz
Cc: Jens Axboe
Cc: Rom Lemarchand
Cc: Android Kernel Team
Cc: Jeff Moyer
Cc:
Cc: Kees Cook
Cc: Kay Sievers
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

San Mehat
2016-03-16 07:55:16 +0800

31 Jan, 2016

1 commit

d1a5f2b4d block: use DAX for partition table reads ... Browse Code »

Avoid populating pagecache when the block device is in DAX mode.
Otherwise these page cache entries collide with the fsync/msync
implementation and break data durability guarantees.

Cc: Jan Kara
Cc: Jeff Moyer
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Andrew Morton
Reported-by: Ross Zwisler
Tested-by: Ross Zwisler
Reviewed-by: Matthew Wilcox
Signed-off-by: Dan Williams

Dan Williams
2016-01-31 05:35:32 +0800

26 Nov, 2015

1 commit

77032ca66 Return EBUSY from BLKRRPART for mounted whole-dev fs ... Browse Code »

Today, blockdev --rereadpt /dev/sda will fail with EBUSY if any
partition of sda is mounted (and will fail with EINVAL if pointed
at a partition). But it will pass if the entire block device is
formatted with a filesystem and mounted. I don't think this makes
sense; partitioning should surely not ever change out from under
a mounted device.

So check for bdev->bd_super, and fail that with -EBUSY as well.

Signed-off-by: Eric Sandeen
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Eric Sandeen
2015-11-26 11:49:24 +0800

22 Oct, 2015

1 commit

25520d55c block: Inline blk_integrity in struct gendisk ... Browse Code »

Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.

This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.

Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:

- Moving the blk_integrity definition to genhd.h.

- Inlining blk_integrity in struct gendisk.

- Removing the dynamic allocation code.

- Adding helper functions which allow gendisk to set up and tear down
the integrity sysfs dir when a disk is added/deleted.

- Adding a blk_integrity_revalidate() callback for updating the stable
pages bdi setting.

- The calls that depend on whether a device has an integrity profile or
not now key off of the bi->profile pointer.

- Simplifying the integrity support routines in DM (Mike Snitzer).

Signed-off-by: Martin K. Petersen
Reported-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Mike Snitzer
Cc: Dan Williams
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Martin K. Petersen
2015-10-22 04:42:42 +0800

17 Jul, 2015

2 commits

6c71013ec block: partition: convert percpu ref ... Browse Code »

Percpu refcount is the perfect match for partition's case,
and the conversion is quite straight.

With the convertion, one pair of atomic inc/dec can be saved
for accounting block I/O, which is run in hot path of block I/O.

Signed-off-by: Ming Lei
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Ming Lei
2015-07-17 22:41:53 +0800
b54e5ed8f block: partition: introduce hd_free_part() ... Browse Code »

So the helper can be used in both generic partition
case and part0 case.

Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2015-07-17 22:41:53 +0800

04 Sep, 2014

1 commit

2da78092d block: Fix dev_t minor allocation lifetime ... Browse Code »

Releases the dev_t minor when all references are closed to prevent
another device from acquiring the same major/minor.

Since the partition's release may be invoked from call_rcu's soft-irq
context, the ext_dev_idr's mutex had to be replaced with a spinlock so
as not so sleep.

Signed-off-by: Keith Busch
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Keith Busch
2014-09-04 05:01:02 +0800

08 Apr, 2013

1 commit

c2fccc1c9 Revert "loop: cleanup partitions when detaching loop device" ... Browse Code »

This reverts commit 8761a3dc1f07b163414e2215a2cadbb4cfe2a107.

There are situations where the destruction path is called
with the bdev->bd_mutex already held, which then deadlocks in
loop_clr_fd(). The normal partition cleanup does a trylock()
on the mutex, but it'd be nice to have a more bullet proof
method in loop. So punt this more involved fix to the next
merge window, and just back out this buggy fix for now.

Signed-off-by: Jens Axboe

Jens Axboe
2013-04-08 16:12:11 +0800

23 Mar, 2013

1 commit

8761a3dc1 loop: cleanup partitions when detaching loop device ... Browse Code »

Any partitions added by user space to the loop device were being
left in place after detaching the loop device. This was because
the detach path issued a BLKRRPART to clean up partitions if
LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto
scanned on attach. Replace this BLKRRPART with code that
unconditionally cleans up partitions on detach instead.

Signed-off-by: Phillip Susi

Modified by Jens to export delete_partition().

Signed-off-by: Jens Axboe

Phillip Susi
2013-03-23 02:21:53 +0800

28 Feb, 2013

2 commits

ac2e5327a block/partitions: optimize memory allocation in check_partition() ... Browse Code »

Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so
it is easy to trigger page allocation failure by check_partition,
especially in hotplug block device situation(such as, USB mass storage,
MMC card, ...), and Felipe Balbi has observed the failure.

This patch does below optimizations on the allocation of struct
parsed_partitions to try to address the issue:

- make parsed_partitions.parts as pointer so that the pointed memory can
fit in 32KB buffer, then approximate 32KB memory can be saved

- vmalloc the buffer pointed by parsed_partitions.parts because 32KB is
still a bit big for kmalloc

- given that many devices have the partition count limit, so only
allocate disk_max_parts() partitions instead of 256 partitions always

Signed-off-by: Ming Lei
Reported-by: Felipe Balbi
Cc: Jens Axboe
Reviewed-by: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ming Lei
2013-02-28 11:10:21 +0800
7b74e9127 block: fix ext_devt_idr handling ... Browse Code »

While adding and removing a lot of disks disks and partitions this
sometimes shows up:

WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
Hardware name:
sysfs: cannot create duplicate filename '/dev/block/259:751'
Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
Call Trace:
warn_slowpath_common+0x87/0xc0
warn_slowpath_fmt+0x46/0x50
sysfs_add_one+0xc9/0x130
sysfs_do_create_link+0x12b/0x170
sysfs_create_link+0x13/0x20
device_add+0x317/0x650
idr_get_new+0x13/0x50
add_partition+0x21c/0x390
rescan_partitions+0x32b/0x470
sd_open+0x81/0x1f0 [sd_mod]
__blkdev_get+0x1b6/0x3c0
blkdev_get+0x10/0x20
register_disk+0x155/0x170
add_disk+0xa6/0x160
sd_probe_async+0x13b/0x210 [sd_mod]
add_wait_queue+0x46/0x60
async_thread+0x102/0x250
default_wake_function+0x0/0x20
async_thread+0x0/0x250
kthread+0x96/0xa0
child_rip+0xa/0x20
kthread+0x0/0xa0
child_rip+0x0/0x20

This most likely happens because dev_t is freed while the number is
still used and idr_get_new() is not protected on every use. The fix
adds a mutex where it wasn't before and moves the dev_t free function so
it is called after device del.

Signed-off-by: Tomas Henzl
Cc: Jens Axboe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tomas Henzl
2013-02-28 11:10:12 +0800

01 Aug, 2012

1 commit

c83f6bf98 block: add partition resize function to blkpg ioctl ... Browse Code »

Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
allows altering the size of an existing partition, even if it is currently
in use.

This patch converts hd_struct->nr_sects into sequence counter because
One might extend a partition while IO is happening to it and update of
nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
can lead to issues like reading inconsistent size of a partition. Sequence
counter have been used so that readers don't have to take bdev mutex lock
as we call sector_in_part() very frequently.

Now all the access to hd_struct->nr_sects should happen using sequence
counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
There is one exception though, set_capacity()/get_capacity(). I think
theoritically race should exist there too but this patch does not
modify set_capacity()/get_capacity() due to sheer number of call sites
and I am afraid that change might break something. I have left that as a
TODO item. We can handle it later if need be. This patch does not introduce
any new races as such w.r.t set_capacity()/get_capacity().

v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.

Signed-off-by: Vivek Goyal
Signed-off-by: Phillip Susi
Signed-off-by: Jens Axboe

Vivek Goyal
2012-08-01 18:24:18 +0800