Eric Lee / smarc-fsl-linux-kernel

09 Feb, 2017

1 commit

72cd604cf fs: break out of iomap_file_buffered_write on fatal signals ... Browse Code »

commit d1908f52557b3230fbd63c0429f3b4b748bf2b6d upstream.

Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion. He has tracked
this down to the following path

__alloc_pages_nodemask+0x436/0x4d0
alloc_pages_current+0x97/0x1b0
__page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728
pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331
grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773
iomap_write_begin+0x50/0xd0 fs/iomap.c:118
iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190
? iomap_write_end+0x80/0x80 fs/iomap.c:150
iomap_apply+0xb3/0x130 fs/iomap.c:79
iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243
? iomap_write_end+0x80/0x80
xfs_file_buffered_aio_write+0x132/0x390 [xfs]
? remove_wait_queue+0x59/0x60
xfs_file_write_iter+0x90/0x130 [xfs]
__vfs_write+0xe5/0x140
vfs_write+0xc7/0x1f0
? syscall_trace_enter+0x1d0/0x380
SyS_write+0x58/0xc0
do_syscall_64+0x6c/0x200
entry_SYSCALL64_slow_path+0x25/0x25

the oom victim has access to all memory reserves to make a forward
progress to exit easier. But iomap_file_buffered_write and other
callers of iomap_apply loop to complete the full request. We need to
check for fatal signals and back off with a short write instead.

As the iomap_apply delegates all the work down to the actor we have to
hook into those. All callers that work with the page cache are calling
iomap_write_begin so we will check for signals there. dax_iomap_actor
has to handle the situation explicitly because it copies data to the
userspace directly. Other callers like iomap_page_mkwrite work on a
single page or iomap_fiemap_actor do not allocate memory based on the
given len.

Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Link: http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Tetsuo Handa
Reviewed-by: Christoph Hellwig
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2017-02-09 15:08:31 +0800

08 Oct, 2016

1 commit

6fcb52a56 thp: reduce usage of huge zero page's atomic counter ... Browse Code »

The global zero page is used to satisfy an anonymous read fault. If
THP(Transparent HugePage) is enabled then the global huge zero page is
used. The global huge zero page uses an atomic counter for reference
counting and is allocated/freed dynamically according to its counter
value.

CPU time spent on that counter will greatly increase if there are a lot
of processes doing anonymous read faults. This patch proposes a way to
reduce the access to the global counter so that the CPU load can be
reduced accordingly.

To do this, a new flag of the mm_struct is introduced:
MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
the global counter in two cases:

1 The first time it uses the global huge zero page;
2 The time when mm_user of its mm_struct reaches zero.

Note that right now, the huge zero page is eligible to be freed as soon
as its last use goes away. With this patch, the page will not be
eligible to be freed until the exit of the last process from which it
was ever used.

And with the use of mm_user, the kthread is not eligible to use huge
zero page either. Since no kthread is using huge zero page today, there
is no difference after applying this patch. But if that is not desired,
I can change it to when mm_count reaches zero.

Case used for test on Haswell EP:

usemem -n 72 --readonly -j 0x200000 100G

Which spawns 72 processes and each will mmap 100G anonymous space and
then do read only access to that space sequentially with a step of 2MB.

CPU cycles from perf report for base commit:
54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
CPU cycles from perf report for this commit:
0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page

Performance(throughput) of the workload for base commit: 1784430792
Performance(throughput) of the workload for this commit: 4726928591
164% increase.

Runtime of the workload for base commit: 707592 us
Runtime of the workload for this commit: 303970 us
50% drop.

Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
Signed-off-by: Aaron Lu
Cc: Sergey Senozhatsky
Cc: "Kirill A. Shutemov"
Cc: Dave Hansen
Cc: Tim Chen
Cc: Huang Ying
Cc: Vlastimil Babka
Cc: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Ebru Akagunduz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aaron Lu
2016-10-08 09:46:28 +0800

19 Sep, 2016

4 commits

a7d73fe6c dax: provide an iomap based fault handler ... Browse Code »

Very similar to the existing dax_fault function, but instead of using
the get_block callback we rely on the iomap_ops vector from iomap.c.
That also avoids having to do two calls into the file system for write
faults.

Signed-off-by: Christoph Hellwig
Reviewed-by: Ross Zwisler
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-09-19 09:24:50 +0800
a254e5681 dax: provide an iomap based dax read/write path ... Browse Code »

This is a much simpler implementation of the DAX read/write path
that makes use of the iomap infrastructure. It does not try to
mirror the direct I/O calling conventions and thus doesn't have to
deal with i_dio_count or the end_io handler, but instead leaves
locking and filesystem-specific I/O completion to the caller.

Signed-off-by: Christoph Hellwig
Reviewed-by: Ross Zwisler
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-09-19 09:24:49 +0800
b0d5e82fc dax: don't pass buffer_head to copy_user_dax ... Browse Code »

This way we can use this helper for the iomap based DAX implementation
as well.

Signed-off-by: Christoph Hellwig
Reviewed-by: Ross Zwisler
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-09-19 09:24:49 +0800
1aaba0958 dax: don't pass buffer_head to dax_insert_mapping ... Browse Code »

This way we can use this helper for the iomap based DAX implementation
as well.

Signed-off-by: Christoph Hellwig
Reviewed-by: Ross Zwisler
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-09-19 09:24:49 +0800

29 Jul, 2016

1 commit

f0c98ebc5 Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Dan Williams:

- Replace pcommit with ADR / directed-flushing.

The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement
either ADR, or provide one or more flush addresses per nvdimm.

ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
to the memory controller on a power-fail event.

Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
A flush hint is an mmio address that when written and fenced assures
that all previous posted writes targeting a given dimm have been
flushed to media.

- On-demand ARS (address range scrub).

Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the
media to refresh the bad block list, userspace can also request a
re-scrub at any time.

- Support for the Microsoft DSM (device specific method) command
format.

- Support for EDK2/OVMF virtual disk device memory ranges.

- Various fixes and cleanups across the subsystem.

* tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
nfit: do an ARS scrub on hitting a latent media error
nfit: move to nfit/ sub-directory
nfit, libnvdimm: allow an ARS scrub to be triggered on demand
libnvdimm: register nvdimm_bus devices with an nd_bus driver
pmem: clarify a debug print in pmem_clear_poison
x86/insn: remove pcommit
Revert "KVM: x86: add pcommit support"
nfit, tools/testing/nvdimm/: unify shutdown paths
libnvdimm: move ->module to struct nvdimm_bus_descriptor
nfit: cleanup acpi_nfit_init calling convention
nfit: fix _FIT evaluation memory leak + use after free
tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
tools/testing/nvdimm: add virtual ramdisk range
acpi, nfit: treat virtual ramdisk SPA as pmem region
pmem: kill __pmem address space
pmem: kill wmb_pmem()
libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
fs/dax: remove wmb_pmem()
libnvdimm, pmem: flush posted-write queues on shutdown
...

Linus Torvalds
2016-07-29 08:38:16 +0800

27 Jul, 2016

1 commit

6b524995a dax: remote unused fault wrappers ... Browse Code »

Remove the unused wrappers dax_fault() and dax_pmd_fault(). After this
removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
dax_pmd_fault() respectively, and update all callers.

The dax_fault() and dax_pmd_fault() wrappers were initially intended to
capture some filesystem independent functionality around page faults
(calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
and ctime).

However, the following commits:

5726b27b09cc ("ext2: Add locking for DAX faults")
ea3d7209ca01 ("ext4: fix races between page faults and hole punching")

added locking to the ext2 and ext4 filesystems after these common
operations but before __dax_fault() and __dax_pmd_fault() were called.
This means that these wrappers are no longer used, and are unlikely to
be used in the future.

XFS has had locking analogous to what was recently added to ext2 and
ext4 since DAX support was initially introduced by:

6b698edeeef0 ("xfs: add DAX file operations support")

Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler
Cc: "Theodore Ts'o"
Cc: Alexander Viro
Cc: Andreas Dilger
Cc: Dan Williams
Cc: Dave Chinner
Reviewed-by: Jan Kara
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ross Zwisler
2016-07-27 07:19:19 +0800

24 Jul, 2016

1 commit

0606263f2 Merge branch 'for-4.8/libnvdimm' into libnvdimm-for-next Browse Code »

Dan Williams
2016-07-24 23:05:44 +0800

13 Jul, 2016

2 commits

7a9eb2066 pmem: kill __pmem address space ... Browse Code »

The __pmem address space was meant to annotate codepaths that touch
persistent memory and need to coordinate a call to wmb_pmem(). Now that
wmb_pmem() is gone, there is little need to keep this annotation.

Cc: Christoph Hellwig
Cc: Ross Zwisler
Signed-off-by: Dan Williams

Dan Williams
2016-07-13 10:25:38 +0800
14df6a4e7 fs/dax: remove wmb_pmem() ... Browse Code »

Flushing posted-write queues is now deferred to REQ_FLUSH context, or
otherwise handled by an ADR event at the platform level.

Cc: Ross Zwisler
Signed-off-by: Dan Williams

Dan Williams
2016-07-13 06:13:48 +0800

28 Jun, 2016

1 commit

023954351 dax: fix offset overflow in dax_io ... Browse Code »

This isn't functionally apparent for some reason, but
when we test io at extreme offsets at the end of the loff_t
rang, such as in fstests xfs/071, the calculation of
"max" in dax_io() can be wrong due to pos + size overflowing.

For example,

# xfs_io -c "pwrite 9223372036854771712 512" /mnt/test/file

enters dax_io with:

start 0x7ffffffffffff000
end 0x7ffffffffffff200

and the rounded up "size" variable is 0x1000. This yields:

pos + size 0x8000000000000000 (overflows loff_t)
end 0x7ffffffffffff200

Due to the overflow, the min() function picks the wrong
value for the "max" variable, and when we send (max - pos)
into i.e. copy_from_iter_pmem() it is also the wrong value.

This somehow(tm) gets magically absorbed without incident,
probably because iter->count is correct. But it seems best
to fix it up properly by comparing the two values as
unsigned.

Signed-off-by: Eric Sandeen
Signed-off-by: Dan Williams

Eric Sandeen
2016-06-28 03:18:44 +0800

27 May, 2016

2 commits

478a1469a Merge tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull DAX locking updates from Ross Zwisler:
"Filesystem DAX locking for 4.7

- We use a bit in an exceptional radix tree entry as a lock bit and
use it similarly to how page lock is used for normal faults. This
fixes races between hole instantiation and read faults of the same
index.

- Filesystem DAX PMD faults are disabled, and will be re-enabled when
PMD locking is implemented"

* tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Remove i_mmap_lock protection
dax: Use radix tree entry lock to protect cow faults
dax: New fault locking
dax: Allow DAX code to replace exceptional entries
dax: Define DAX lock bit for radix tree exceptional entry
dax: Make huge page handling depend of CONFIG_BROKEN
dax: Fix condition for filling of PMD holes

Linus Torvalds
2016-05-27 11:00:28 +0800
315227f6d Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull misc DAX updates from Vishal Verma:
"DAX error handling for 4.7

- Until now, dax has been disabled if media errors were found on any
device. This enables the use of DAX in the presence of these
errors by making all sector-aligned zeroing go through the driver.

- The driver (already) has the ability to clear errors on writes that
are sent through the block layer using 'DSMs' defined in ACPI 6.1.

Other misc changes:

- When mounting DAX filesystems, check to make sure the partition is
page aligned. This is a requirement for DAX, and previously, we
allowed such unaligned mounts to succeed, but subsequent
reads/writes would fail.

- Misc/cleanup fixes from Jan that remove unused code from DAX
related to zeroing, writeback, and some size checks"

* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: fix a comment in dax_zero_page_range and dax_truncate_page
dax: for truncate/hole-punch, do zeroing through the driver if possible
dax: export a low-level __dax_zero_page_range helper
dax: use sb_issue_zerout instead of calling dax_clear_sectors
dax: enable dax in the presence of known media errors (badblocks)
dax: fallback from pmd to pte on error
block: Update blkdev_dax_capable() for consistency
xfs: Add alignment check for DAX mount
ext2: Add alignment check for DAX mount
ext4: Add alignment check for DAX mount
block: Add bdev_dax_supported() for dax mount checks
block: Add vfs_msg() interface
dax: Remove redundant inode size checks
dax: Remove pointless writeback from dax_do_io()
dax: Remove zeroing from dax_io()
dax: Remove dead zeroing code from fault handlers
ext2: Avoid DAX zeroing to corrupt data
ext2: Fix block zeroing in ext2_get_blocks() for DAX
dax: Remove complete_unwritten argument
DAX: move RADIX_DAX_ definitions to dax.c

Linus Torvalds
2016-05-27 10:34:26 +0800

25 May, 2016

1 commit

0e01df100 Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 updates from Ted Ts'o:
"Fix a number of bugs, most notably a potential stale data exposure
after a crash and a potential BUG_ON crash if a file has the data
journalling flag enabled while it has dirty delayed allocation blocks
that haven't been written yet. Also fix a potential crash in the new
project quota code and a maliciously corrupted file system.

In addition, fix some DAX-specific bugs, including when there is a
transient ENOSPC situation and races between writes via direct I/O and
an mmap'ed segment that could lead to lost I/O.

Finally the usual set of miscellaneous cleanups"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: pre-zero allocated blocks for DAX IO
ext4: refactor direct IO code
ext4: fix race in transient ENOSPC detection
ext4: handle transient ENOSPC properly for DAX
dax: call get_blocks() with create == 1 for write faults to unwritten extents
ext4: remove unmeetable inconsisteny check from ext4_find_extent()
jbd2: remove excess descriptions for handle_s
ext4: remove unnecessary bio get/put
ext4: silence UBSAN in ext4_mb_init()
ext4: address UBSAN warning in mb_find_order_for_block()
ext4: fix oops on corrupted filesystem
ext4: fix check of dqget() return value in ext4_ioctl_setproject()
ext4: clean up error handling when orphan list is corrupted
ext4: fix hang when processing corrupted orphaned inode list
ext4: remove trailing \n from ext4_warning/ext4_error calls
ext4: fix races between changing inode journal mode and ext4_writepages
ext4: handle unwritten or delalloc buffers before enabling data journaling
ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
ext4: do not ask jbd2 to write data for delalloc buffers
jbd2: add support for avoiding data writes during transaction commits
...

Linus Torvalds
2016-05-25 03:55:26 +0800

21 May, 2016

1 commit

78a9be0a0 dax: move RADIX_DAX_ definitions to dax.c ... Browse Code »

These don't belong in radix-tree.h any more than PAGECACHE_TAG_* do.
Let's try to maintain the idea that radix-tree simply implements an
abstract data type.

Signed-off-by: NeilBrown
Reviewed-by: Ross Zwisler
Reviewed-by: Jan Kara
Signed-off-by: Matthew Wilcox
Cc: Konstantin Khlebnikov
Cc: Kirill Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2016-05-21 08:58:30 +0800

20 May, 2016

6 commits

4d9a2c874 dax: Remove i_mmap_lock protection ... Browse Code »

Currently faults are protected against truncate by filesystem specific
i_mmap_sem and page lock in case of hole page. Cow faults are protected
DAX radix tree entry locking. So there's no need for i_mmap_lock in DAX
code. Remove it.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Ross Zwisler

Jan Kara
2016-05-20 05:28:40 +0800
bc2466e42 dax: Use radix tree entry lock to protect cow faults ... Browse Code »

When doing cow faults, we cannot directly fill in PTE as we do for other
faults as we rely on generic code to do proper accounting of the cowed page.
We also have no page to lock to protect against races with truncate as
other faults have and we need the protection to extend until the moment
generic code inserts cowed page into PTE thus at that point we have no
protection of fs-specific i_mmap_sem. So far we relied on using
i_mmap_lock for the protection however that is completely special to cow
faults. To make fault locking more uniform use DAX entry lock instead.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Ross Zwisler

Jan Kara
2016-05-20 05:27:49 +0800
ac401cc78 dax: New fault locking ... Browse Code »

Currently DAX page fault locking is racy.

CPU0 (write fault) CPU1 (read fault)

__dax_fault() __dax_fault()
get_block(inode, block, &bh, 0) -> not mapped
get_block(inode, block, &bh, 0)
-> not mapped
if (!buffer_mapped(&bh))
if (vmf->flags & FAULT_FLAG_WRITE)
get_block(inode, block, &bh, 1) -> allocates blocks
if (page) -> no
if (!buffer_mapped(&bh))
if (vmf->flags & FAULT_FLAG_WRITE) {
} else {
dax_load_hole();
}
dax_insert_mapping()

And we are in a situation where we fail in dax_radix_entry() with -EIO.

Another problem with the current DAX page fault locking is that there is
no race-free way to clear dirty tag in the radix tree. We can always
end up with clean radix tree and dirty data in CPU cache.

We fix the first problem by introducing locking of exceptional radix
tree entries in DAX mappings acting very similarly to page lock and thus
synchronizing properly faults against the same mapping index. The same
lock can later be used to avoid races when clearing radix tree dirty
tag.

Reviewed-by: NeilBrown
Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Ross Zwisler

Jan Kara
2016-05-20 05:20:54 +0800
e804315dd dax: Define DAX lock bit for radix tree exceptional entry ... Browse Code »

We will use lowest available bit in the radix tree exceptional entry for
locking of the entry. Define it. Also clean up definitions of DAX entry
type bits in DAX exceptional entries to use defined constants instead of
hardcoding numbers and cleanup checking of these bits to not rely on how
other bits in the entry are set.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Ross Zwisler

Jan Kara
2016-05-20 05:14:55 +0800
348e967ab dax: Make huge page handling depend of CONFIG_BROKEN ... Browse Code »

Currently the handling of huge pages for DAX is racy. For example the
following can happen:

CPU0 (THP write fault) CPU1 (normal read fault)

__dax_pmd_fault() __dax_fault()
get_block(inode, block, &bh, 0) -> not mapped
get_block(inode, block, &bh, 0)
-> not mapped
if (!buffer_mapped(&bh) && write)
get_block(inode, block, &bh, 1) -> allocates blocks
truncate_pagecache_range(inode, lstart, lend);
dax_load_hole();

This results in data corruption since process on CPU1 won't see changes
into the file done by CPU0.

The race can happen even if two normal faults race however with THP the
situation is even worse because the two faults don't operate on the same
entries in the radix tree and we want to use these entries for
serialization. So make THP support in DAX code depend on CONFIG_BROKEN
for now.

Signed-off-by: Jan Kara
Signed-off-by: Ross Zwisler

Jan Kara
2016-05-20 05:13:17 +0800
b9953536c dax: Fix condition for filling of PMD holes ... Browse Code »

Currently dax_pmd_fault() decides to fill a PMD-sized hole only if
returned buffer has BH_Uptodate set. However that doesn't get set for
any mapping buffer so that branch is actually a dead code. The
BH_Uptodate check doesn't make any sense so just remove it.

Signed-off-by: Jan Kara
Signed-off-by: Ross Zwisler

Jan Kara
2016-05-20 05:13:00 +0800

19 May, 2016

4 commits

40543f62c dax: fix a comment in dax_zero_page_range and dax_truncate_page ... Browse Code »

The distinction between PAGE_SIZE and PAGE_CACHE_SIZE was removed in

09cbfea mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release}
macros

The comments for the above functions described a distinction between
those, that is now redundant, so remove those paragraphs

Cc: Kirill A. Shutemov
Reviewed-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Vishal Verma

Vishal Verma
2016-05-19 02:16:58 +0800
4b0228fa1 dax: for truncate/hole-punch, do zeroing through the driver if possible ... Browse Code »

In the truncate or hole-punch path in dax, we clear out sub-page ranges.
If these sub-page ranges are sector aligned and sized, we can do the
zeroing through the driver instead so that error-clearing is handled
automatically.

For sub-sector ranges, we still have to rely on clear_pmem and have the
possibility of tripping over errors.

Cc: Dan Williams
Cc: Ross Zwisler
Cc: Jeff Moyer
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jan Kara
Reviewed-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Vishal Verma

Vishal Verma
2016-05-19 02:16:57 +0800
679c8bd3b dax: export a low-level __dax_zero_page_range helper ... Browse Code »

This allows XFS to perform zeroing using the iomap infrastructure and
avoid buffer heads.

Reviewed-by: Jan Kara
Signed-off-by: Christoph Hellwig
[vishal: fix conflicts with dax-error-handling]
Signed-off-by: Vishal Verma

Christoph Hellwig
2016-05-19 02:16:56 +0800
3dc291610 dax: use sb_issue_zerout instead of calling dax_clear_sectors ... Browse Code »

dax_clear_sectors() cannot handle poisoned blocks. These must be
zeroed using the BIO interface instead. Convert ext2 and XFS to use
only sb_issue_zerout().

Reviewed-by: Jeff Moyer
Reviewed-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Matthew Wilcox
[vishal: Also remove the dax_clear_sectors function entirely]
Signed-off-by: Vishal Verma

Matthew Wilcox
2016-05-19 02:16:56 +0800

17 May, 2016

7 commits

8b3db9798 dax: fallback from pmd to pte on error ... Browse Code »

In preparation for consulting a badblocks list in pmem_direct_access(),
teach dax_pmd_fault() to fallback rather than fail immediately upon
encountering an error. The thought being that reducing the span of the
dax request may avoid the error region.

Reviewed-by: Jeff Moyer
Reviewed-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Dan Williams
Signed-off-by: Vishal Verma

Dan Williams
2016-05-17 14:44:13 +0800
7795bec89 dax: Remove redundant inode size checks ... Browse Code »

Callers of dax fault handlers must make sure these calls cannot race
with truncate. Thus it is enough to check inode size when entering the
function and we don't have to recheck it again later in the handler.
Note that inode size itself can be decreased while the fault handler
runs but filesystem locking prevents against any radix tree or block
mapping information changes resulting from the truncate and that is what
we really care about.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

Jan Kara
2016-05-17 14:44:10 +0800
c3d98e39d dax: Remove pointless writeback from dax_do_io() ... Browse Code »

dax_do_io() is calling filemap_write_and_wait() if DIO_LOCKING flags is
set. Presumably this was copied over from direct IO code. However DAX
inodes have no pagecache pages to write so the call is pointless. Remove
it.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

Jan Kara
2016-05-17 14:44:09 +0800
069c77bc9 dax: Remove zeroing from dax_io() ... Browse Code »

All the filesystems are now zeroing blocks themselves for DAX IO to avoid
races between dax_io() and dax_fault(). Remove the zeroing code from
dax_io() and add warning to catch the case when somebody unexpectedly
returns new or unwritten buffer.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

Jan Kara
2016-05-17 14:44:09 +0800
2b10945c5 dax: Remove dead zeroing code from fault handlers ... Browse Code »

Now that all filesystems zero out blocks allocated for a fault handler,
we can just remove the zeroing from the handler itself. Also add checks
that no filesystem returns to us unwritten or new buffer.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

Jan Kara
2016-05-17 14:44:08 +0800
02fbd1397 dax: Remove complete_unwritten argument ... Browse Code »

Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

Jan Kara
2016-05-17 08:11:51 +0800
e4b274915 DAX: move RADIX_DAX_ definitions to dax.c ... Browse Code »

These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
Let's try to maintain the idea that radix-tree simply implements an
abstract data type.

Acked-by: Ross Zwisler
Reviewed-by: Matthew Wilcox
Signed-off-by: NeilBrown
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

NeilBrown
2016-05-17 08:11:51 +0800

13 May, 2016

1 commit

aef39ab15 dax: call get_blocks() with create == 1 for write faults to unwritten extents ... Browse Code »

Currently, __dax_fault() does not call get_blocks() callback with create
argument set, when we got back unwritten extent from the initial
get_blocks() call during a write fault. This is because originally
filesystems were supposed to convert unwritten extents to written ones
using complete_unwritten() callback. Later this was abandoned in favor of
using pre-zeroed blocks however the condition whether get_blocks() needs
to be called with create == 1 remained.

Fix the condition so that filesystems are not forced to zero-out and
convert unwritten extents when get_blocks() is called with create == 0
(which introduces unnecessary overhead for read faults and can be
problematic as the filesystem may possibly be read-only).

Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-05-13 12:38:15 +0800

02 May, 2016

1 commit

c8b8e32d7 direct-io: eliminate the offset argument to ->direct_IO ... Browse Code »

Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually
work, so eliminate the superflous argument.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2016-05-02 07:58:39 +0800

05 Apr, 2016

2 commits

ea1754a08 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage ... Browse Code »

Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-04-05 01:41:08 +0800
09cbfeaf1 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros ... Browse Code »

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-04-05 01:41:08 +0800

22 Mar, 2016

1 commit

53d2e6976 Merge tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs ... Browse Code »

Pull xfs updates from Dave Chinner:
"There's quite a lot in this request, and there's some cross-over with
ext4, dax and quota code due to the nature of the changes being made.

As for the rest of the XFS changes, there are lots of little things
all over the place, which add up to a lot of changes in the end.

The major changes are that we've reduced the size of the struct
xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
>10%), the writepage code now only does a single set of mapping tree
lockups so uses less CPU, delayed allocation reservations won't
overrun under random write loads anymore, and we added compile time
verification for on-disk structure sizes so we find out when a commit
or platform/compiler change breaks the on disk structure as early as
possible.

Change summary:

- error propagation for direct IO failures fixes for both XFS and
ext4
- new quota interfaces and XFS implementation for iterating all the
quota IDs in the filesystem
- locking fixes for real-time device extent allocation
- reduction of duplicate information in the xfs and vfs inode, saving
roughly 100 bytes of memory per cached inode.
- buffer flag cleanup
- rework of the writepage code to use the generic write clustering
mechanisms
- several fixes for inode flag based DAX enablement
- rework of remount option parsing
- compile time verification of on-disk format structure sizes
- delayed allocation reservation overrun fixes
- lots of little error handling fixes
- small memory leak fixes
- enable xfsaild freezing again"

* tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
xfs: always set rvalp in xfs_dir2_node_trim_free
xfs: ensure committed is initialized in xfs_trans_roll
xfs: borrow indirect blocks from freed extent when available
xfs: refactor delalloc indlen reservation split into helper
xfs: update freeblocks counter after extent deletion
xfs: debug mode forced buffered write failure
xfs: remove impossible condition
xfs: check sizes of XFS on-disk structures at compile time
xfs: ioends require logically contiguous file offsets
xfs: use named array initializers for log item dumping
xfs: fix computation of inode btree maxlevels
xfs: reinitialise per-AG structures if geometry changes during recovery
xfs: remove xfs_trans_get_block_res
xfs: fix up inode32/64 (re)mount handling
xfs: fix format specifier , should be %llx and not %llu
xfs: sanitize remount options
xfs: convert mount option parsing to tokens
xfs: fix two memory leaks in xfs_attr_list.c error paths
xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
...

Linus Torvalds
2016-03-22 02:53:05 +0800

10 Mar, 2016

1 commit

30f471fd8 dax: check return value of dax_radix_entry() ... Browse Code »

dax_pfn_mkwrite() previously wasn't checking the return value of the
call to dax_radix_entry(), which was a mistake.

Instead, capture this return value and return the appropriate VM_FAULT_
value.

Signed-off-by: Ross Zwisler
Cc: Dan Williams
Cc: Matthew Wilcox
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ross Zwisler
2016-03-10 07:43:42 +0800

28 Feb, 2016

1 commit

7f6d5b529 dax: move writeback calls into the filesystems ... Browse Code »

Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().

dax_writeback_mapping_range() needs a struct block_device, and it used
to get that from inode->i_sb->s_bdev. This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.

Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device. This also fixes DAX code to properly flush caches in response
to sync(2).

Signed-off-by: Ross Zwisler
Signed-off-by: Jan Kara
Cc: Al Viro
Cc: Dan Williams
Cc: Dave Chinner
Cc: Jens Axboe
Cc: Matthew Wilcox
Cc: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ross Zwisler
2016-02-28 02:28:52 +0800