Eric Lee / smarc-fsl-linux-kernel

28 Feb, 2016

2 commits

7f6d5b529 dax: move writeback calls into the filesystems ... Browse Code »

Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().

dax_writeback_mapping_range() needs a struct block_device, and it used
to get that from inode->i_sb->s_bdev. This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.

Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device. This also fixes DAX code to properly flush caches in response
to sync(2).

Signed-off-by: Ross Zwisler
Signed-off-by: Jan Kara
Cc: Al Viro
Cc: Dan Williams
Cc: Dave Chinner
Cc: Jens Axboe
Cc: Matthew Wilcox
Cc: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ross Zwisler
2016-02-28 02:28:52 +0800
03cdadb04 block: disable block device DAX by default ... Browse Code »

The recent *sync enabling discovered that we are inserting into the
block_device pagecache counter to the expectations of the dirty data
tracking for dax mappings. This can lead to data corruption.

We want to support DAX for block devices eventually, but it requires
wider changes to properly manage the pagecache.

dump_stack+0x85/0xc2
dax_writeback_mapping_range+0x60/0xe0
blkdev_writepages+0x3f/0x50
do_writepages+0x21/0x30
__filemap_fdatawrite_range+0xc6/0x100
filemap_write_and_wait+0x4a/0xa0
set_blocksize+0x70/0xd0
sb_set_blocksize+0x1d/0x50
ext4_fill_super+0x75b/0x3360
mount_bdev+0x180/0x1b0
ext4_mount+0x15/0x20
mount_fs+0x38/0x170

Mark the support broken so its disabled by default, but otherwise still
available for testing.

Signed-off-by: Dan Williams
Signed-off-by: Ross Zwisler
Reported-by: Ross Zwisler
Suggested-by: Dave Chinner
Reviewed-by: Jan Kara
Cc: Jens Axboe
Cc: Matthew Wilcox
Cc: Al Viro
Cc: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2016-02-28 02:28:52 +0800

06 Feb, 2016

1 commit

9c5a05bc3 block: fix pfn_mkwrite() DAX fault handler ... Browse Code »

Previously the pfn_mkwrite() fault handler for raw block devices called
bldev_dax_fault() -> __dax_fault() to do a full DAX page fault.

Really what the pfn_mkwrite() fault handler needs to do is call
dax_pfn_mkwrite() to make sure that the radix tree entry for the given
PTE is marked as dirty so that a follow-up fsync or msync call will
flush it durably to media.

Fixes: 5a023cdba50c ("block: enable dax for raw block devices")
Signed-off-by: Ross Zwisler
Cc: Alexander Viro
Cc: Dan Williams
Cc: Dave Chinner
Reviewed-by: Jan Kara
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ross Zwisler
2016-02-06 10:10:40 +0800

31 Jan, 2016

1 commit

9f4736fe7 block: revert runtime dax control of the raw block device ... Browse Code »

Dynamically enabling DAX requires that the page cache first be flushed
and invalidated. This must occur atomically with the change of DAX mode
otherwise we confuse the fsync/msync tracking and violate data
durability guarantees. Eliminate the possibilty of DAX-disabled to
DAX-enabled transitions for now and revisit this for the next cycle.

Cc: Jan Kara
Cc: Jeff Moyer
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Matthew Wilcox
Cc: Andrew Morton
Cc: Ross Zwisler
Signed-off-by: Dan Williams

Dan Williams
2016-01-31 05:35:31 +0800

24 Jan, 2016

1 commit

cc673757e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull final vfs updates from Al Viro:

- The ->i_mutex wrappers (with small prereq in lustre)

- a fix for too early freeing of symlink bodies on shmem (they need to
be RCU-delayed) (-stable fodder)

- followup to dedupe stuff merged this cycle

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: abort dedupe loop if fatal signals are pending
make sure that freeing shmem fast symlinks is RCU-delayed
wrappers for ->i_mutex access
lustre: remove unused declaration

Linus Torvalds
2016-01-24 04:24:56 +0800

23 Jan, 2016

2 commits

f9fe48bec dax: support dirty DAX entries in radix tree ... Browse Code »

Add support for tracking dirty DAX entries in the struct address_space
radix tree. This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new
exceptional entries into the radix tree that represent dirty DAX PTE or
PMD pages. These exceptional entries will also contain the writeback
addresses for the PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third. We rely
on the fact that only one type of exceptional entry can be found in a
given radix tree based on its usage. This happens for free with DAX vs
shmem but we explicitly prevent shadow entries from being added to radix
trees for DAX mappings.

The only shadow entries that would be generated for DAX radix trees
would be to track zero page mappings that were created for holes. These
pages would receive minimal benefit from having shadow entries, and the
choice to have only one type of exceptional entry in a given radix tree
makes the logic simpler both in clear_exceptional_entry() and in the
rest of DAX.

Signed-off-by: Ross Zwisler
Cc: "H. Peter Anvin"
Cc: "J. Bruce Fields"
Cc: "Theodore Ts'o"
Cc: Alexander Viro
Cc: Andreas Dilger
Cc: Dave Chinner
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jeff Layton
Cc: Matthew Wilcox
Cc: Thomas Gleixner
Cc: Dan Williams
Cc: Matthew Wilcox
Cc: Dave Hansen
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ross Zwisler
2016-01-23 09:02:18 +0800
5955102c9 wrappers for ->i_mutex access ... Browse Code »

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro

Al Viro
2016-01-23 07:04:28 +0800

20 Jan, 2016

1 commit

7c24d9f3b Merge branch 'for-4.5/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block updates from Jens Axboe:
"We don't have a lot of core changes this time around, it's mostly in
drivers, which will come in a subsequent pull.

The cores changes include:

- blk-mq
- Prep patch from Christoph, changing blk_mq_alloc_request() to
take flags instead of just using gfp_t for sleep/nosleep.
- Doc patch from me, clarifying the difference between legacy
and blk-mq for timer usage.
- Fixes from Raghavendra for memory-less numa nodes, and a reuse
of CPU masks.

- Cleanup from Geliang Tang, using offset_in_page() instead of open
coding it.

- From Ilya, rename request_queue slab to it reflects what it holds,
and a fix for proper use of bdgrab/put.

- A real fix for the split across stripe boundaries from Keith. We
yanked a broken version of this from 4.4-rc final, this one works.

- From Mike Krinkin, emit a trace message when we split.

- From Wei Tang, two small cleanups, not explicitly clearing memory
that is already cleared"

* 'for-4.5/core' of git://git.kernel.dk/linux-block:
block: use bd{grab,put}() instead of open-coding
block: split bios to max possible length
block: add call to split trace point
blk-mq: Avoid memoryless numa node encoded in hctx numa_node
blk-mq: Reuse hardware context cpumask for tags
blk-mq: add a flags parameter to blk_mq_alloc_request
Revert "blk-flush: Queue through IO scheduler when flush not required"
block: clarify blk_add_timer() use case for blk-mq
bio: use offset_in_page macro
block: do not initialise statics to 0 or NULL
block: do not initialise globals to 0 or NULL
block: rename request_queue slab cache

Linus Torvalds
2016-01-20 07:03:34 +0800

16 Jan, 2016

2 commits

b2e0d1625 dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() ... Browse Code »

The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against the driver for the underlying
block device being disabled. Use blk_queue_enter()/blk_queue_exit() to
hold off blk_cleanup_queue() from proceeding, or otherwise fail new
mapping requests if the request_queue is being torn down.

This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
through dax_map_atomic() to bdev_direct_access().

[willy@linux.intel.com: fix read() of a hole]
Signed-off-by: Dan Williams
Reviewed-by: Jeff Moyer
Cc: Jan Kara
Cc: Jens Axboe
Cc: Dave Chinner
Cc: Ross Zwisler
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2016-01-16 09:56:32 +0800
fe683adab dax: guarantee page aligned results from bdev_direct_access() ... Browse Code »

If a ->direct_access() implementation ever returns a map count less than
PAGE_SIZE, catch the error in bdev_direct_access(). This simplifies
error checking in upper layers.

Signed-off-by: Dan Williams
Reported-by: Ross Zwisler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2016-01-16 09:56:32 +0800

15 Jan, 2016

2 commits

b832861cc fs/block_dev.c:bdev_write_page(): use blk_queue_enter(..., GFP_NOIO) ... Browse Code »

bdev_write_page() is used by swapout and by writepage where we cannot
use __GFP_FS or __GFP_IO. So it is misleading to mention GFP_KERNEL
here.

blk_queue_enter() only actually looks at __GFP_DIRECT_RECLAIM, so no
bugs were harmed in the making of this patch.

Cc: Dan Williams
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2016-01-15 08:00:49 +0800
5d097056c kmemcg: account certain kmem allocations to memcg ... Browse Code »

Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:

- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800

14 Jan, 2016

2 commits

d080827f8 Merge tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Dan Williams:
"The bulk of this has appeared in -next and independently received a
build success notification from the kbuild robot. The 'for-4.5/block-
dax' topic branch was rebased over the weekend to drop the "block
device end-of-life" rework that Al would like to see re-implemented
with a notifier, and to address bug reports against the badblocks
integration.

There is pending feedback against "libnvdimm: Add a poison list and
export badblocks" received last week. Linda identified some localized
fixups that we will handle incrementally.

Summary:

- Media error handling: The 'badblocks' implementation that
originated in md-raid is up-levelled to a generic capability of a
block device. This initial implementation is limited to being
consulted in the pmem block-i/o path. Later, 'badblocks' will be
consulted when creating dax mappings.

- Raw block device dax: For virtualization and other cases that want
large contiguous mappings of persistent memory, add the capability
to dax-mmap a block device directly.

- Increased /dev/mem restrictions: Add an option to treat all
io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access
while a driver is actively using an address range. This behavior
is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be
overridden by the existing "iomem=relaxed" kernel command line
option.

- Miscellaneous fixes include a 'pfn'-device huge page alignment fix,
block device shutdown crash fix, and other small libnvdimm fixes"

* tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (32 commits)
block: kill disk_{check|set|clear|alloc}_badblocks
libnvdimm, pmem: nvdimm_read_bytes() badblocks support
pmem, dax: disable dax in the presence of bad blocks
pmem: fail io-requests to known bad blocks
libnvdimm: convert to statically allocated badblocks
libnvdimm: don't fail init for full badblocks list
block, badblocks: introduce devm_init_badblocks
block: clarify badblocks lifetime
badblocks: rename badblocks_free to badblocks_exit
libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h
libnvdimm: Add a poison list and export badblocks
nfit_test: Enable DSMs for all test NFITs
md: convert to use the generic badblocks code
block: Add badblock management for gendisks
badblocks: Add core badblock management code
block: fix del_gendisk() vs blkdev_ioctl crash
block: enable dax for raw block devices
block: introduce bdev_file_inode()
restrict /dev/mem to idle io memory ranges
arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug
...

Linus Torvalds
2016-01-14 11:15:14 +0800
ed8a9d2c8 block: use bd{grab,put}() instead of open-coding ... Browse Code »

- bd_acquire() and bd_forget() open-code bdgrab() and bdput()
- raw driver uses igrab() but never checks its return value and always
holds another ref from bind_set() while calling it, so it's
equivalent to bdgrab()

Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe

Ilya Dryomov
2016-01-14 01:24:27 +0800

09 Jan, 2016

2 commits

5a023cdba block: enable dax for raw block devices ... Browse Code »

If an application wants exclusive access to all of the persistent memory
provided by an NVDIMM namespace it can use this raw-block-dax facility
to forgo establishing a filesystem. This capability is targeted
primarily to hypervisors wanting to provision persistent memory for
guests. It can be disabled / enabled dynamically via the new BLKDAXSET
ioctl.

Cc: Jeff Moyer
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Andrew Morton
Cc: Ross Zwisler
Reported-by: kbuild test robot
Reviewed-by: Jan Kara
Signed-off-by: Dan Williams

Dan Williams
2016-01-09 22:30:49 +0800
4ebb16ca9 block: introduce bdev_file_inode() ... Browse Code »

Similar to the file_inode() helper, provide a helper to lookup the inode for a
raw block device itself.

Cc: Al Viro
Suggested-by: Jan Kara
Reviewed-by: Jan Kara
Reviewed-by: Jeff Moyer
Signed-off-by: Dan Williams

Dan Williams
2016-01-09 22:30:49 +0800

07 Jan, 2016

1 commit

424081f3c fs: use gendisk->disk_name where possible ... Browse Code »

gendisk with part==0 is obviously gendisk->disk_name.

Signed-off-by: Dmitry Monakhov
Signed-off-by: Al Viro

Dmitry Monakhov
2016-01-07 01:42:11 +0800

05 Dec, 2015

1 commit

43d1c0eb7 block: detach bdev inode from its wb in __blkdev_put() ... Browse Code »

Since 52ebea749aae ("writeback: make backing_dev_info host
cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
gets attached to a wb (struct bdi_writeback). Detaching happens on
evict, in inode_detach_wb() called from __destroy_inode(), and involves
updating wb.

However, detaching an internal bdev inode from its wb in
__destroy_inode() is too late. Its bdi and by extension root wb are
embedded into struct request_queue, which has different lifetime rules
and can be freed long before the final bdput() is called (can be from
__fput() of a corresponding /dev inode, through dput() - evict() -
bd_forget(). bdevs hold onto the underlying disk/queue pair only while
opened; as soon as bdev is closed all bets are off. In fact,
disk/queue can be gone before __blkdev_put() even returns:

1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
1500 {
...
1518 if (bdev->bd_contains == bdev) {
1519 if (disk->fops->release)
1520 disk->fops->release(disk, mode);

[ Driver puts its references to disk/queue ]

1521 }
1522 if (!bdev->bd_openers) {
1523 struct module *owner = disk->fops->owner;
1524
1525 disk_put_part(bdev->bd_part);
1526 bdev->bd_part = NULL;
1527 bdev->bd_disk = NULL;
1528 if (bdev != bdev->bd_contains)
1529 victim = bdev->bd_contains;
1530 bdev->bd_contains = NULL;
1531
1532 put_disk(disk);

[ We put ours, the queue is gone
The last bdput() would result in a write to invalid memory ]

1533 module_put(owner);
...
1539 }

Since bdev inodes are special anyway, detach them in __blkdev_put()
after clearing inode's dirty bits, turning the problematic
inode_detach_wb() in __destroy_inode() into a noop.

add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
gendisk hold a reference to its queue"), so the old ->release comment
is removed in favor of the new inode_detach_wb() comment.

Cc: stable@vger.kernel.org # 4.2+, needs backporting
Signed-off-by: Ilya Dryomov
Acked-by: Tejun Heo
Tested-by: Raghavendra K T
Signed-off-by: Jens Axboe

Ilya Dryomov
2015-12-05 02:02:17 +0800

02 Dec, 2015

1 commit

6f3b0e8bc blk-mq: add a flags parameter to blk_mq_alloc_request ... Browse Code »

We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t. Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-12-02 01:53:59 +0800

20 Nov, 2015

1 commit

2e6edc953 block: protect rw_page against device teardown ... Browse Code »

Fix use after free crashes like the following:

general protection fault: 0000 [#1] SMP
Call Trace:
[] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
[] pmem_rw_page+0x42/0x80 [nd_pmem]
[] bdev_read_page+0x50/0x60
[] do_mpage_readpage+0x510/0x770
[] ? I_BDEV+0x20/0x20
[] ? lru_cache_add+0x1c/0x50
[] mpage_readpages+0x107/0x170
[] ? I_BDEV+0x20/0x20
[] ? I_BDEV+0x20/0x20
[] blkdev_readpages+0x1d/0x20
[] __do_page_cache_readahead+0x28f/0x310
[] ? __do_page_cache_readahead+0x169/0x310
[] ? pagecache_get_page+0x2d/0x1d0
[] filemap_fault+0x396/0x530
[] __do_fault+0x4e/0xf0
[] handle_mm_fault+0x11bd/0x1b50

Cc:
Cc: Jens Axboe
Cc: Alexander Viro
Reported-by: kbuild test robot
Acked-by: Matthew Wilcox
[willy: symmetry fixups]
Signed-off-by: Dan Williams

Dan Williams
2015-11-20 05:47:10 +0800

12 Nov, 2015

1 commit

dbd3ca507 fs/block_dev.c: Remove WARN_ON() when inode writeback fails ... Browse Code »

If a block device is hot removed and later last reference to device
is put, we try to writeback the dirty inode. But device is gone and
that writeback fails.

Currently we do a WARN_ON() which does not seem to be the right thing.
Convert it to a ratelimited kernel warning.

Reported-by: Andi Kleen
Signed-off-by: Vivek Goyal
Acked-by: Tejun Heo
[jmoyer@redhat.com: get rid of unnecessary name initialization, 80 cols]
Signed-off-by: Jeff Moyer
Signed-off-by: Jens Axboe

Vivek Goyal
2015-11-12 00:36:57 +0800

22 Oct, 2015

1 commit

25520d55c block: Inline blk_integrity in struct gendisk ... Browse Code »

Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.

This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.

Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:

- Moving the blk_integrity definition to genhd.h.

- Inlining blk_integrity in struct gendisk.

- Removing the dynamic allocation code.

- Adding helper functions which allow gendisk to set up and tear down
the integrity sysfs dir when a disk is added/deleted.

- Adding a blk_integrity_revalidate() callback for updating the stable
pages bdi setting.

- The calls that depend on whether a device has an integrity profile or
not now key off of the bi->profile pointer.

- Simplifying the integrity support routines in DM (Mike Snitzer).

Signed-off-by: Martin K. Petersen
Reported-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Mike Snitzer
Cc: Dan Williams
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Martin K. Petersen
2015-10-22 04:42:42 +0800

16 Sep, 2015

1 commit

f0b2e563b blockdev: don't set S_DAX for misaligned partitions ... Browse Code »

The dax code doesn't currently support misaligned partitions,
so disable O_DIRECT via dax until such time as that support
materializes.

Cc:
Suggested-by: Boaz Harrosh
Signed-off-by: Jeff Moyer
Signed-off-by: Dan Williams

Jeff Moyer
2015-09-16 08:08:05 +0800

09 Sep, 2015

3 commits

f6f7a6369 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge second patch-bomb from Andrew Morton:
"Almost all of the rest of MM. There was an unusually large amount of
MM material this time"

* emailed patches from Andrew Morton : (141 commits)
zpool: remove no-op module init/exit
mm: zbud: constify the zbud_ops
mm: zpool: constify the zpool_ops
mm: swap: zswap: maybe_preload & refactoring
zram: unify error reporting
zsmalloc: remove null check from destroy_handle_cache()
zsmalloc: do not take class lock in zs_shrinker_count()
zsmalloc: use class->pages_per_zspage
zsmalloc: consider ZS_ALMOST_FULL as migrate source
zsmalloc: partial page ordering within a fullness_list
zsmalloc: use shrinker to trigger auto-compaction
zsmalloc: account the number of compacted pages
zsmalloc/zram: introduce zs_pool_stats api
zsmalloc: cosmetic compaction code adjustments
zsmalloc: introduce zs_can_compact() function
zsmalloc: always keep per-class stats
zsmalloc: drop unused variable `nr_to_migrate'
mm/memblock.c: fix comment in __next_mem_range()
mm/page_alloc.c: fix type information of memoryless node
memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
...

Linus Torvalds
2015-09-09 08:52:23 +0800
c94c2acf8 dax: move DAX-related functions to a new header ... Browse Code »

In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
defined in linux/mm.h. Given that we don't want to include
in , the easiest solution is to move the DAX-related
functions to a new header, . We could also have moved
VM_FAULT_* definitions to a new header, or a different header that isn't
quite such a boil-the-ocean header as , but this felt like
the best option.

Signed-off-by: Matthew Wilcox
Cc: Hillf Danton
Cc: "Kirill A. Shutemov"
Cc: Theodore Ts'o
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2015-09-09 06:35:28 +0800
12f03ee60 Merge tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Dan Williams:
"This update has successfully completed a 0day-kbuild run and has
appeared in a linux-next release. The changes outside of the typical
drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
the introduction of ZONE_DEVICE + devm_memremap_pages().

Summary:

- Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map.

This facility is used by the pmem driver to enable pfn_to_page()
operations on the page frames returned by DAX ('direct_access' in
'struct block_device_operations').

For now, the 'memmap' allocation for these "device" pages comes
from "System RAM". Support for allocating the memmap from device
memory will arrive in a later kernel.

- Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3.

Completion of the conversion is targeted for v4.4.

- Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.

- Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.

- Miscellaneous updates and fixes to libnvdimm including support for
issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes"

* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
libnvdimm, pmem: direct map legacy pmem by default
libnvdimm, pmem: 'struct page' for pmem
libnvdimm, pfn: 'struct page' provider infrastructure
x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
add devm_memremap_pages
mm: ZONE_DEVICE for "device memory"
mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
dax: drop size parameter to ->direct_access()
nd_blk: change aperture mapping from WC to WB
nvdimm: change to use generic kvfree()
pmem, dax: have direct_access use __pmem annotation
dax: update I/O path to do proper PMEM flushing
pmem: add copy_from_iter_pmem() and clear_pmem()
pmem, x86: clean up conditional pmem includes
pmem: remove layer when calling arch_has_wmb_pmem()
pmem, x86: move x86 PMEM API to new pmem.h header
libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
pmem: switch to devm_ allocations
devres: add devm_memremap
libnvdimm, btt: write and validate parent_uuid
...

Linus Torvalds
2015-09-09 05:35:59 +0800

28 Aug, 2015

1 commit

cb389b9c0 dax: drop size parameter to ->direct_access() ... Browse Code »

None of the implementations currently use it. The common
bdev_direct_access() entry point handles all the size checks before
calling ->direct_access().

Signed-off-by: Christoph Hellwig
Signed-off-by: Dan Williams

Dan Williams
2015-08-28 07:40:58 +0800

21 Aug, 2015

1 commit

e2e05394e pmem, dax: have direct_access use __pmem annotation ... Browse Code »

Update the annotation for the kaddr pointer returned by direct_access()
so that it is a __pmem pointer. This is consistent with the PMEM driver
and with how this direct_access() pointer is used in the DAX code.

Signed-off-by: Ross Zwisler
Reviewed-by: Christoph Hellwig
Signed-off-by: Dan Williams

Ross Zwisler
2015-08-21 02:07:24 +0800

18 Aug, 2015

1 commit

74278da9f inode: convert inode_sb_list_lock to per-sb ... Browse Code »

The process of reducing contention on per-superblock inode lists
starts with moving the locking to match the per-superblock inode
list. This takes the global lock out of the picture and reduces the
contention problems to within a single filesystem. This doesn't get
rid of contention as the locks still have global CPU scope, but it
does isolate operations on different superblocks form each other.

Signed-off-by: Dave Chinner
Signed-off-by: Josef Bacik
Reviewed-by: Jan Kara
Reviewed-by: Christoph Hellwig
Tested-by: Dave Chinner

Dave Chinner
2015-08-18 06:39:46 +0800

05 Jul, 2015

3 commits

1dc51b828 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull more vfs updates from Al Viro:
"Assorted VFS fixes and related cleanups (IMO the most interesting in
that part are f_path-related things and Eric's descriptor-related
stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups". The
file_table locking rule change by Eric Dumazet is a rather big and
fundamental update even if the patch isn't huge. - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
9p: cope with bogus responses from server in p9_client_{read,write}
p9_client_write(): avoid double p9_free_req()
9p: forgetting to cancel request on interrupted zero-copy RPC
dax: bdev_direct_access() may sleep
block: Add support for DAX reads/writes to block devices
dax: Use copy_from_iter_nocache
dax: Add block size note to documentation
fs/file.c: __fget() and dup2() atomicity rules
fs/file.c: don't acquire files->file_lock in fd_install()
fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
vfs: avoid creation of inode number 0 in get_next_ino
namei: make set_root_rcu() return void
make simple_positive() public
ufs: use dir_pages instead of ufs_dir_pages()
pagemap.h: move dir_pages() over there
remove the pointless include of lglock.h
fs: cleanup slight list_entry abuse
xfs: Correctly lock inode when removing suid and file capabilities
fs: Call security_ops->inode_killpriv on truncate
fs: Provide function telling whether file_remove_privs() will do anything
...

Linus Torvalds
2015-07-05 10:36:06 +0800
43c3dd08d dax: bdev_direct_access() may sleep ... Browse Code »

The brd driver is the only in-tree driver that may sleep currently.
After some discussion on linux-fsdevel, we decided that any driver
may choose to sleep in its ->direct_access method. To ensure that all
callers of bdev_direct_access() are prepared for this, add a call
to might_sleep().

Signed-off-by: Matthew Wilcox
Signed-off-by: Al Viro

Matthew Wilcox
2015-07-05 03:56:57 +0800
bbab37ddc block: Add support for DAX reads/writes to block devices ... Browse Code »

If a block device supports the ->direct_access methods, bypass the normal
DIO path and use DAX to go straight to memcpy() instead of allocating
a DIO and a BIO.

Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in
do_blockdev_direct_IO().

Signed-off-by: Matthew Wilcox
Signed-off-by: Al Viro

Matthew Wilcox
2015-07-05 03:56:57 +0800

01 Jul, 2015

1 commit

43baed34b Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull more block layer patches from Jens Axboe:
"A few later arrivers that I didn't fold into the first pull request,
so we had a chance to run some testing. This contains:

- NVMe:
- Set of fixes from Keith
- 4.4 and earlier gcc build fix from Andrew

- small set of xen-blk{back,front} fixes from Bob Liu.

- warnings fix for bogus inline statement in I_BDEV() from Geert.

- error code fixup for SG_IO ioctl from Paolo Bonzini"

* 'for-linus' of git://git.kernel.dk/linux-block:
drivers/block/nvme-core.c: fix build with gcc-4.4.4
bdi: Remove "inline" keyword from exported I_BDEV() implementation
block: fix bogus EFAULT error from SG_IO ioctl
NVMe: Fix filesystem deadlock on removal
NVMe: Failed controller initialization fixes
NVMe: Unify controller probe and resume
NVMe: Don't use fake status on cancelled command
NVMe: Fix device cleanup on initialization failure
drivers: xen-blkfront: only talk_to_blkback() when in XenbusStateInitialising
xen/block: add multi-page ring support
driver: xen-blkfront: move talk_to_blkback to a more suitable place
drivers: xen-blkback: delay pending_req allocation to connect_ring

Linus Torvalds
2015-07-01 10:46:34 +0800

30 Jun, 2015

1 commit

88793e5c7 Merge tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm ... Browse Code »

Pull libnvdimm subsystem from Dan Williams:
"The libnvdimm sub-system introduces, in addition to the
libnvdimm-core, 4 drivers / enabling modules:

NFIT:
Instantiates an "nvdimm bus" with the core and registers memory
devices (NVDIMMs) enumerated by the ACPI 6.0 NFIT (NVDIMM Firmware
Interface table).

After registering NVDIMMs the NFIT driver then registers "region"
devices. A libnvdimm-region defines an access mode and the
boundaries of persistent memory media. A region may span multiple
NVDIMMs that are interleaved by the hardware memory controller. In
turn, a libnvdimm-region can be carved into a "namespace" device and
bound to the PMEM or BLK driver which will attach a Linux block
device (disk) interface to the memory.

PMEM:
Initially merged in v4.1 this driver for contiguous spans of
persistent memory address ranges is re-worked to drive
PMEM-namespaces emitted by the libnvdimm-core.

In this update the PMEM driver, on x86, gains the ability to assert
that writes to persistent memory have been flushed all the way
through the caches and buffers in the platform to persistent media.
See memcpy_to_pmem() and wmb_pmem().

BLK:
This new driver enables access to persistent memory media through
"Block Data Windows" as defined by the NFIT. The primary difference
of this driver to PMEM is that only a small window of persistent
memory is mapped into system address space at any given point in
time.

Per-NVDIMM windows are reprogrammed at run time, per-I/O, to access
different portions of the media. BLK-mode, by definition, does not
support DAX.

BTT:
This is a library, optionally consumed by either PMEM or BLK, that
converts a byte-accessible namespace into a disk with atomic sector
update semantics (prevents sector tearing on crash or power loss).

The sinister aspect of sector tearing is that most applications do
not know they have a atomic sector dependency. At least today's
disk's rarely ever tear sectors and if they do one almost certainly
gets a CRC error on access. NVDIMMs will always tear and always
silently. Until an application is audited to be robust in the
presence of sector-tearing the usage of BTT is recommended.

Thanks to: Ross Zwisler, Jeff Moyer, Vishal Verma, Christoph Hellwig,
Ingo Molnar, Neil Brown, Boaz Harrosh, Robert Elliott, Matthew Wilcox,
Andy Rudoff, Linda Knippers, Toshi Kani, Nicholas Moulin, Rafael
Wysocki, and Bob Moore"

* tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm: (33 commits)
arch, x86: pmem api for ensuring durability of persistent memory updates
libnvdimm: Add sysfs numa_node to NVDIMM devices
libnvdimm: Set numa_node to NVDIMM devices
acpi: Add acpi_map_pxm_to_online_node()
libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only
pmem: flag pmem block devices as non-rotational
libnvdimm: enable iostat
pmem: make_request cleanups
libnvdimm, pmem: fix up max_hw_sectors
libnvdimm, blk: add support for blk integrity
libnvdimm, btt: add support for blk integrity
fs/block_dev.c: skip rw_page if bdev has integrity
libnvdimm: Non-Volatile Devices
tools/testing/nvdimm: libnvdimm unit test infrastructure
libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
nd_btt: atomic sector updates
libnvdimm: infrastructure for btt devices
libnvdimm: write blk label set
libnvdimm: write pmem label set
libnvdimm: blk labels and namespace instantiation
...

Linus Torvalds
2015-06-30 01:34:42 +0800

28 Jun, 2015

1 commit

ff5053f66 bdi: Remove "inline" keyword from exported I_BDEV() implementation ... Browse Code »

With gcc 3.4.6/4.1.2/4.2.4 (not with 4.4.7/4.6.4/4.8.4):

CC fs/block_dev.o
include/linux/fs.h:804: warning: ‘I_BDEV’ declared inline after being called
include/linux/fs.h:804: warning: previous declaration of ‘I_BDEV’ was here

Commit a212b105b07d75b4 ("bdi: make inode_to_bdi() inline") added a
caller of I_BDEV() in a header file, exposing the bogus "inline" on the
exported implementation.

Drop the "inline" keyword to fix this.

Signed-off-by: Geert Uytterhoeven
Signed-off-by: Jens Axboe

Geert Uytterhoeven
2015-06-28 01:44:12 +0800

26 Jun, 2015

1 commit

f68eb1e71 fs/block_dev.c: skip rw_page if bdev has integrity ... Browse Code »

If a block device has bio integrity enabled, rw_page will bypass the
integrity payload, which is undesirable. Skip rw_page if this is the
case.

Currently brd and zram provide rw_page, and the proposed 'nd' drivers
will too.

Cc: Jens Axboe
Cc: Martin K. Petersen
Suggested-by: Matthew Wilcox
Signed-off-by: Vishal Verma
Signed-off-by: Dan Williams

Vishal Verma
2015-06-26 23:23:38 +0800

02 Jun, 2015

2 commits

a212b105b bdi: make inode_to_bdi() inline ... Browse Code »

Now that bdi definitions are moved to backing-dev-defs.h,
backing-dev.h can include blkdev.h and inline inode_to_bdi() without
worrying about introducing circular include dependency. The function
gets called from hot paths and fairly trivial.

This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
function calls inline. blockdev_superblock and noop_backing_dev_info
are EXPORT_GPL'd to allow the inline functions to be used from
modules.

While at it, make sb_is_blkdev_sb() return bool instead of int.

v2: Fixed typo in description as suggested by Jan.

Signed-off-by: Tejun Heo
Reviewed-by: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:34 +0800
66114cad6 writeback: separate out include/linux/backing-dev-defs.h ... Browse Code »

With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it. c files
which need access to more backing-dev details now include
backing-dev.h directly. This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Cc: Jens Axboe
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:34 +0800

25 Apr, 2015

1 commit

fe0f07d08 direct-io: only inc/dec inode->i_dio_count for file systems ... Browse Code »

do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:

clat percentiles (usec):
| 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34],
| 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35],
| 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80],
| 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155],
| 99.99th=[ 165]

After:

clat percentiles (usec):
| 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149],
| 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171],
| 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270],
| 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422],
| 99.99th=[ 438]

In other setups, Robert Elliott reported seeing good performance
improvements:

https://lkml.org/lkml/2015/4/3/557

The more applications accessing the device, the worse it gets.

Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.

Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Theodore Ts'o
Cc: Elliott, Robert (Server Storage)
Cc: Al Viro
Signed-off-by: Jens Axboe
Signed-off-by: Al Viro

Jens Axboe
2015-04-25 03:45:28 +0800

16 Apr, 2015

1 commit

bb668734c VFS: assorted d_backing_inode() annotations ... Browse Code »

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2015-04-16 03:06:59 +0800