Eric Lee / smarc-fsl-linux-kernel

10 May, 2017

1 commit

339fbf679 Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs fix from Al Viro:
"Braino fix for iov_iter_revert() misuse"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix braino in generic_file_read_iter()

Linus Torvalds
2017-05-10 00:01:21 +0800

09 May, 2017

2 commits

c718a9751 fs: semove set but not checked AOP_FLAG_UNINTERRUPTIBLE flag ... Browse Code »

Commit afddba49d18f ("fs: introduce write_begin, write_end, and
perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
checked in pagecache_write_begin(), but that check was removed by
4e02ed4b4a2f ("fs: remove prepare_write/commit_write").

Between these two commits, commit d9414774dc0c ("cifs: Convert cifs to
new aops.") added a check in cifs_write_begin(), but that check was soon
removed by commit a98ee8c1c707 ("[CIFS] fix regression in
cifs_write_begin/cifs_write_end").

Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere. Let's
remove this flag. This patch has no functionality changes.

Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa
Reviewed-by: Jeff Layton
Reviewed-by: Christoph Hellwig
Cc: Nick Piggin
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2017-05-09 08:15:14 +0800
5b47d59af fix braino in generic_file_read_iter() ... Browse Code »

Wrong sign of iov_iter_revert() argument. Unfortunately, slipped through
the testing, since most of the time we don't do anything to the iterator
afterwards and potential oops on walking the iter->iov too far backwards
is too infrequent to be easily triggered.

Add a sanity check in iov_iter_revert() to catch bugs like this one;
fortunately, the same braino hadn't happened in other callers, but we'd
better have a warning if such thing crops up.

Signed-off-by: Al Viro

Al Viro
2017-05-09 01:54:47 +0800

04 May, 2017

2 commits

55635ba76 fs: fix data invalidation in the cleancache during direct IO ... Browse Code »

Patch series "Properly invalidate data in the cleancache", v2.

We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache. The reason for this is
that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.

Another odd thing is that we check only for ->nrpages and don't check
for ->nrexceptional, but invalidate_inode_pages2[_range] also
invalidates exceptional entries as well. So we invalidate exceptional
entries only if ->nrpages != 0? This doesn't feel right.

- Patch 1 fixes direct IO writes by removing ->nrpages check.
- Patch 2 fixes similar case in invalidate_bdev().
Note: I only fixed conditional cleancache_invalidate_inode() here.
Do we also need to add ->nrexceptional check in into invalidate_bdev()?

- Patches 3-4: some optimizations.

This patch (of 4):

Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero. This can't be right,
because invalidate_inode_pages2[_range]() also invalidate data in the
cleancache via cleancache_invalidate_inode() call. So if page cache is
empty but there is some data in the cleancache, buffered read after
direct IO write would get stale data from the cleancache.

Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.

Fix this by calling invalidate_inode_pages2[_range]() regardless of
nrpages state.

Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
they are not affected by this bug.

Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Reviewed-by: Jan Kara
Acked-by: Konrad Rzeszutek Wilk
Cc: Alexander Viro
Cc: Ross Zwisler
Cc: Jens Axboe
Cc: Johannes Weiner
Cc: Alexey Kuznetsov
Cc: Christoph Hellwig
Cc: Nikolay Borisov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2017-05-04 06:52:12 +0800
9ab2594fe mm: tighten up the fault path a little ... Browse Code »

The round_up() macro generates a couple of unnecessary instructions
in this usage:

48cd: 49 8b 47 50 mov 0x50(%r15),%rax
48d1: 48 83 e8 01 sub $0x1,%rax
48d5: 48 0d ff 0f 00 00 or $0xfff,%rax
48db: 48 83 c0 01 add $0x1,%rax
48df: 48 c1 f8 0c sar $0xc,%rax
48e3: 48 39 c3 cmp %rax,%rbx
48e6: 72 2e jb 4916

If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y)
then GCC can see through it and remove the mask (because that would be
dead code given the subsequent shift):

48cd: 49 8b 47 50 mov 0x50(%r15),%rax
48d1: 48 05 ff 0f 00 00 add $0xfff,%rax
48d7: 48 c1 e8 0c shr $0xc,%rax
48db: 48 39 c3 cmp %rax,%rbx
48de: 72 2e jb 490e

But that's problematic because we'd evaluate 'y' twice. Converting
round_up into an inline function prevents it from being used in other
definitions. The easiest thing to do is just change these three usages
of round_up to use DIV_ROUND_UP. Also add an unlikely() because GCC's
heuristic is wrong in this case.

Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.org
Signed-off-by: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2017-05-04 06:52:09 +0800

03 May, 2017

1 commit

5b13475a5 Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull iov_iter updates from Al Viro:
"Cleanups that sat in -next + -stable fodder that has just missed 4.11.

There's more iov_iter work in my local tree, but I'd prefer to push
the stuff that had been in -next first"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
iov_iter: don't revert iov buffer if csum error
generic_file_read_iter(): make use of iov_iter_revert()
generic_file_direct_write(): make use of iov_iter_revert()
orangefs: use iov_iter_revert()
sctp: switch to copy_from_iter_full()
net/9p: switch to copy_from_iter_full()
switch memcpy_from_msg() to copy_from_iter_full()
rds: make use of iov_iter_revert()

Linus Torvalds
2017-05-03 02:18:50 +0800

22 Apr, 2017

2 commits

5ecda1371 generic_file_read_iter(): make use of iov_iter_revert() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2017-04-22 01:57:47 +0800
639a93a52 generic_file_direct_write(): make use of iov_iter_revert() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2017-04-22 01:57:36 +0800

03 Apr, 2017

1 commit

0e056eb55 kernel-api.rst: fix a series of errors when parsing C files ... Browse Code »

./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
./mm/filemap.c:1283: ERROR: Unexpected indentation.
./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
./ipc/util.c:676: ERROR: Unexpected indentation.
./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
./security/security.c:109: ERROR: Unexpected indentation.
./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
./ipc/util.c:477: ERROR: Unknown target name: "s".

Signed-off-by: Mauro Carvalho Chehab
Acked-by: Bjorn Helgaas
Signed-off-by: Jonathan Corbet

mchehab@s-opensource.com
2017-04-03 04:31:49 +0800

02 Mar, 2017

1 commit

3f07c0144 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> ... Browse Code »

We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-03-02 15:42:29 +0800

25 Feb, 2017

2 commits

dd8416c47 mm: do not access page->mapping directly on page_endio ... Browse Code »

With rw_page, page_endio is used for completing IO on a page and it
propagates write error to the address space if the IO fails. The
problem is it accesses page->mapping directly which might be okay for
file-backed pages but it shouldn't for anonymous page. Otherwise, it
can corrupt one of field from anon_vma under us and system goes panic
randomly.

swap_writepage
bdev_writepage
ops->rw_page

I encountered the BUG during developing new zram feature and it was
really hard to figure it out because it made random crash, somtime
mmap_sem lockdep, sometime other places where places never related to
zram/zsmalloc, and not reproducible with some configuration.

When I consider how that bug is subtle and people do fast-swap test with
brd, it's worth to add stable mark, I think.

Fixes: dd6bd0d9c7db ("swap: use bdev_read_page() / bdev_write_page()")
Signed-off-by: Minchan Kim
Acked-by: Michal Hocko
Cc: Matthew Wilcox
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-02-25 09:46:56 +0800
11bac8000 mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf ... Browse Code »

->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.

Remove the vma parameter to simplify things.

[arnd@arndb.de: fix ARM build]
Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang
Signed-off-by: Arnd Bergmann
Reviewed-by: Ross Zwisler
Cc: Theodore Ts'o
Cc: Darrick J. Wong
Cc: Matthew Wilcox
Cc: Dave Hansen
Cc: Christoph Hellwig
Cc: Jan Kara
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Jiang
2017-02-25 09:46:54 +0800

23 Feb, 2017

2 commits

870667553 mm: fix filemap.c kernel-doc warnings ... Browse Code »

Fix kernel-doc warnings in mm/filemap.c:

mm/filemap.c:993: warning: No description found for parameter '__page'
mm/filemap.c:993: warning: Excess function parameter 'page' description in '__lock_page'

Link: http://lkml.kernel.org/r/a66fe492-518c-ad6c-5f03-5e8b721fb451@infradead.org
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2017-02-23 08:41:29 +0800
74d81bfae mm: un-export wake_up_page functions ... Browse Code »

These are no longer used outside mm/filemap.c, so un-export them and
make them static where possible. These were exported specifically for
NFS use in commit a4796e37c12e ("MM: export page_wakeup functions").

Link: http://lkml.kernel.org/r/20170103182234.30141-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin
Cc: Trond Myklebust
Cc: Anna Schumaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicholas Piggin
2017-02-23 08:41:29 +0800

04 Feb, 2017

1 commit

5abf186a3 mm, fs: check for fatal signals in do_generic_file_read() ... Browse Code »

do_generic_file_read() can be told to perform a large request from
userspace. If the system is under OOM and the reading task is the OOM
victim then it has an access to memory reserves and finishing the full
request can lead to the full memory depletion which is dangerous. Make
sure we rather go with a short read and allow the killed task to
terminate.

Link: http://lkml.kernel.org/r/20170201092706.9966-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reviewed-by: Christoph Hellwig
Cc: Tetsuo Handa
Cc: Al Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-02-04 06:13:19 +0800

11 Jan, 2017

1 commit

965d004af dax: fix deadlock with DAX 4k holes ... Browse Code »

Currently in DAX if we have three read faults on the same hole address we
can end up with the following:

Thread 0 Thread 1 Thread 2
-------- -------- --------
dax_iomap_fault
grab_mapping_entry
lock_slot

dax_iomap_fault
grab_mapping_entry
get_unlocked_mapping_entry

dax_iomap_fault
grab_mapping_entry
get_unlocked_mapping_entry

dax_load_hole
find_or_create_page
...
page_cache_tree_insert
dax_wake_mapping_entry_waiter

__radix_tree_replace

get_page
lock_page
...
put_locked_mapping_entry
unlock_page
put_page

The crux of the problem is that once we insert a 4k zero page, all
locking from then on is done in terms of that 4k zero page and any
additional threads sleeping on the empty DAX entry will never be woken.

Fix this by waking all sleepers when we replace the DAX radix tree entry
with a 4k zero page. This will allow all sleeping threads to
successfully transition from locking based on the DAX empty entry to
locking on the 4k zero page.

With the test case reported by Xiong this happens very regularly in my
test setup, with some runs resulting in 9+ threads in this deadlocked
state. With this fix I've been able to run that same test dozens of
times in a loop without issue.

Fixes: ac401cc78242 ("dax: New fault locking")
Link: http://lkml.kernel.org/r/1483479365-13607-1-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler
Reported-by: Xiong Zhou
Reviewed-by: Jan Kara
Cc: [4.7+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ross Zwisler
2017-01-11 10:31:54 +0800

30 Dec, 2016

2 commits

98473f9f3 mm/filemap: fix parameters to test_bit() ... Browse Code »

mm/filemap.c: In function 'clear_bit_unlock_is_negative_byte':
mm/filemap.c:933:9: error: too few arguments to function 'test_bit'
return test_bit(PG_waiters);
^~~~~~~~

Fixes: b91e1302ad9b ('mm: optimize PageWaiters bit use for unlock_page()')
Signed-off-by: Olof Johansson
Brown-paper-bag-by: Linus Torvalds
Signed-off-by: Linus Torvalds

Olof Johansson
2016-12-30 06:46:39 +0800
b91e1302a mm: optimize PageWaiters bit use for unlock_page() ... Browse Code »

In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.

However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.

On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.

On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.

So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.

This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.

The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.

So this introduces the new architecture primitive

clear_bit_unlock_is_negative_byte();

and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.

All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.

(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).

Acked-by: Nick Piggin
Cc: Dave Hansen
Cc: Bob Peterson
Cc: Steven Whitehouse
Cc: Andrew Lutomirski
Cc: Andreas Gruenbacher
Cc: Peter Zijlstra
Cc: Mel Gorman
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-12-30 03:03:15 +0800

26 Dec, 2016

1 commit

629060270 mm: add PageWaiters indicating tasks are waiting for a page bit ... Browse Code »

Add a new page flag, PageWaiters, to indicate the page waitqueue has
tasks waiting. This can be tested rather than testing waitqueue_active
which requires another cacheline load.

This bit is always set when the page has tasks on page_waitqueue(page),
and is set and cleared under the waitqueue lock. It may be set when
there are no tasks on the waitqueue, which will cause a harmless extra
wakeup check that will clears the bit.

The generic bit-waitqueue infrastructure is no longer used for pages.
Instead, waitqueues are used directly with a custom key type. The
generic code was not flexible enough to have PageWaiters manipulation
under the waitqueue lock (which simplifies concurrency).

This improves the performance of page lock intensive microbenchmarks by
2-3%.

Putting two bits in the same word opens the opportunity to remove the
memory barrier between clearing the lock bit and testing the waiters
bit, after some work on the arch primitives (e.g., ensuring memory
operand widths match and cover both bits).

Signed-off-by: Nicholas Piggin
Cc: Dave Hansen
Cc: Bob Peterson
Cc: Steven Whitehouse
Cc: Andrew Lutomirski
Cc: Andreas Gruenbacher
Cc: Peter Zijlstra
Cc: Mel Gorman
Signed-off-by: Linus Torvalds

Nicholas Piggin
2016-12-26 03:54:48 +0800

15 Dec, 2016

4 commits

a57cb1c1d Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:

- a few misc things

- kexec updates

- DMA-mapping updates to better support networking DMA operations

- IPC updates

- various MM changes to improve DAX fault handling

- lots of radix-tree changes, mainly to the test suite. All leading up
to reimplementing the IDA/IDR code to be a wrapper layer over the
radix-tree. However the final trigger-pulling patch is held off for
4.11.

* emailed patches from Andrew Morton : (114 commits)
radix tree test suite: delete unused rcupdate.c
radix tree test suite: add new tag check
radix-tree: ensure counts are initialised
radix tree test suite: cache recently freed objects
radix tree test suite: add some more functionality
idr: reduce the number of bits per level from 8 to 6
rxrpc: abstract away knowledge of IDR internals
tpm: use idr_find(), not idr_find_slowpath()
idr: add ida_is_empty
radix tree test suite: check multiorder iteration
radix-tree: fix replacement for multiorder entries
radix-tree: add radix_tree_split_preload()
radix-tree: add radix_tree_split
radix-tree: add radix_tree_join
radix-tree: delete radix_tree_range_tag_if_tagged()
radix-tree: delete radix_tree_locate_item()
radix-tree: improve multiorder iterators
btrfs: fix race in btrfs_free_dummy_fs_info()
radix-tree: improve dump output
radix-tree: make radix_tree_find_next_bit more useful
...

Linus Torvalds
2016-12-15 09:25:18 +0800
82b0f8c39 mm: join struct fault_env and vm_fault ... Browse Code »

Currently we have two different structures for passing fault information
around - struct vm_fault and struct fault_env. DAX will need more
information in struct vm_fault to handle its faults so the content of
that structure would become event closer to fault_env. Furthermore it
would need to generate struct fault_env to be able to call some of the
generic functions. So at this point I don't think there's much use in
keeping these two structures separate. Just embed into struct vm_fault
all that is needed to use it for both purposes.

Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Kirill A. Shutemov
Cc: Ross Zwisler
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2016-12-15 08:04:09 +0800
d05c5f7ba vfs,mm: fix return value of read() at s_maxbytes ... Browse Code »

We truncated the possible read iterator to s_maxbytes in commit
c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()"),
but our end condition handling was wrong: it's not an error to try to
read at the end of the file.

Reading past the end should return EOF (0), not EINVAL.

See for example

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649342
http://lists.gnu.org/archive/html/bug-coreutils/2016-12/msg00008.html

where a md5sum of a maximally sized file fails because the final read is
exactly at s_maxbytes.

Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
Reported-by: Joseph Salisbury
Cc: Wei Fang
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Al Viro
Cc: Andrew Morton
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-12-15 04:45:25 +0800
5084fdf08 Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 updates from Ted Ts'o:
"This merge request includes the dax-4.0-iomap-pmd branch which is
needed for both ext4 and xfs dax changes to use iomap for DAX. It also
includes the fscrypt branch which is needed for ubifs encryption work
as well as ext4 encryption and fscrypt cleanups.

Lots of cleanups and bug fixes, especially making sure ext4 is robust
against maliciously corrupted file systems --- especially maliciously
corrupted xattr blocks and a maliciously corrupted superblock. Also
fix ext4 support for 64k block sizes so it works well on ppcle. Fixed
mbcache so we don't miss some common xattr blocks that can be merged"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
dax: Fix sleep in atomic contex in grab_mapping_entry()
fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL
fscrypt: Delay bounce page pool allocation until needed
fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page()
fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page()
fscrypt: Never allocate fscrypt_ctx on in-place encryption
fscrypt: Use correct index in decrypt path.
fscrypt: move the policy flags and encryption mode definitions to uapi header
fscrypt: move non-public structures and constants to fscrypt_private.h
fscrypt: unexport fscrypt_initialize()
fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info()
fscrypto: move ioctl processing more fully into common code
fscrypto: remove unneeded Kconfig dependencies
MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches
ext4: do not perform data journaling when data is encrypted
ext4: return -ENOMEM instead of success
ext4: reject inodes with negative size
ext4: remove another test in ext4_alloc_file_blocks()
Documentation: fix description of ext4's block_validity mount option
ext4: fix checks for data=ordered and journal_async_commit options
...

Linus Torvalds
2016-12-15 01:17:42 +0800

13 Dec, 2016

4 commits

dbc446b88 mm: workingset: restore refault tracking for single-page files ... Browse Code »

Shadow entries in the page cache used to be accounted behind the radix
tree implementation's back in the upper bits of node->count, and the
radix tree code extending a single-entry tree with a shadow entry in
root->rnode would corrupt that counter. As a result, we could not put
shadow entries at index 0 if the tree didn't have any other entries, and
that means no refault detection for any single-page file.

Now that the shadow entries are tracked natively in the radix tree's
exceptional counter, this is no longer necessary. Extending and
shrinking the tree from and to single entries in root->rnode now does
the right thing when the entry is exceptional, remove that limitation.

Link: http://lkml.kernel.org/r/20161117193244.GF23430@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Jan Kara
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-12-13 10:55:08 +0800
14b468791 mm: workingset: move shadow entry tracking to radix tree exceptional tracking ... Browse Code »

Currently, we track the shadow entries in the page cache in the upper
bits of the radix_tree_node->count, behind the back of the radix tree
implementation. Because the radix tree code has no awareness of them,
we rely on random subtleties throughout the implementation (such as the
node->count != 1 check in the shrinking code, which is meant to exclude
multi-entry nodes but also happens to skip nodes with only one shadow
entry, as that's accounted in the upper bits). This is error prone and
has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't
plant shadow entries without radix tree node").

To remove these subtleties, this patch moves shadow entry tracking from
the upper bits of node->count to the existing counter for exceptional
entries. node->count goes back to being a simple counter of valid
entries in the tree node and can be shrunk to a single byte.

This vastly simplifies the page cache code. All accounting happens
natively inside the radix tree implementation, and maintaining the LRU
linkage of shadow nodes is consolidated into a single function in the
workingset code that is called for leaf nodes affected by a change in
the page cache tree.

This also removes the last user of the __radix_delete_node() return
value. Eliminate it.

Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Jan Kara
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-12-13 10:55:08 +0800
6d75f366b lib: radix-tree: check accounting of existing slot replacement users ... Browse Code »

The bug in khugepaged fixed earlier in this series shows that radix tree
slot replacement is fragile; and it will become more so when not only
NULL!NULL transitions need to be caught but transitions from and to
exceptional entries as well. We need checks.

Re-implement radix_tree_replace_slot() on top of the sanity-checked
__radix_tree_replace(). This requires existing callers to also pass the
radix tree root, but it'll warn us when somebody replaces slots with
contents that need proper accounting (transitions between NULL entries,
real entries, exceptional entries) and where a replacement through the
slot pointer would corrupt the radix tree node counts.

Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org
Signed-off-by: Johannes Weiner
Suggested-by: Jan Kara
Reviewed-by: Jan Kara
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-12-13 10:55:08 +0800
c70b647d3 mm/filemap.c: add comment for confusing logic in page_cache_tree_insert() ... Browse Code »

Unlike THP, hugetlb pages are represented by one entry in the
radix-tree.

[akpm@linux-foundation.org: tweak comment]
Link: http://lkml.kernel.org/r/20161110163640.126124-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Acked-by: Johannes Weiner
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-12-13 10:55:08 +0800

14 Nov, 2016

1 commit

a2f6d9c4c Merge branch 'dax-4.10-iomap-pmd' into origin Browse Code »

Theodore Ts'o
2016-11-14 11:02:15 +0800

08 Nov, 2016

2 commits

642261ac9 dax: add struct iomap based DAX PMD support ... Browse Code »

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking. This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries. The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set. This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees. This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset. The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases. This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry. We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler
Reviewed-by: Jan Kara
Signed-off-by: Dave Chinner

Ross Zwisler
2016-11-08 08:34:45 +0800
63e95b5c4 dax: coordinate locking for offsets in PMD range ... Browse Code »

DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry. The 'mapping'
pointer is still used in the same way.

Signed-off-by: Ross Zwisler
Reviewed-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Dave Chinner

Ross Zwisler
2016-11-08 08:32:20 +0800

07 Nov, 2016

1 commit

6d6d36bc6 mm/filemap: don't allow partially uptodate page for pipes ... Browse Code »

Starting from 4.9-rc1 kernel, I started noticing some test failures
of sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when
testing on sub-page block size filesystems (tested both XFS and
ext4), these syscalls start to return EIO in the tests. e.g.

sendfile02 1 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1
sendfile02 2 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1
sendfile02 3 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1
sendfile02 4 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1

This is because that in sub-page block size cases, we don't need the
whole page to be uptodate, only the part we care about is uptodate
is OK (if fs has ->is_partially_uptodate defined). But
page_cache_pipe_buf_confirm() doesn't have the ability to check the
partially-uptodate case, it needs the whole page to be uptodate. So
it returns EIO in this case.

This is a regression introduced by commit 82c156f85384 ("switch
generic_file_splice_read() to use of ->read_iter()"). Prior to the
change, generic_file_splice_read() doesn't allow partially-uptodate
page either, so it worked fine.

Fix it by skipping the partially-uptodate check if we're working on
a pipe in do_generic_file_read(), so we read the whole page from
disk as long as the page is not uptodate.

Signed-off-by: Eryu Guan
Signed-off-by: Al Viro

Eryu Guan
2016-11-07 02:29:15 +0800

28 Oct, 2016

1 commit

9dcb8b685 mm: remove per-zone hashtable of bitlock waitqueues ... Browse Code »

The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:

wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().

The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).

It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.

As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.

Peter Zijlstra already has a patch for that, but let's see if anybody
even notices. In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.

Reported-by: Andreas Gruenbacher
Tested-by: Bob Peterson
Acked-by: Mel Gorman
Cc: Peter Zijlstra
Cc: Andy Lutomirski
Cc: Steven Whitehouse
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-10-28 00:27:57 +0800

11 Oct, 2016

2 commits

fed41f7d0 Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull splice fixups from Al Viro:
"A couple of fixups for interaction of pipe-backed iov_iter with
O_DIRECT reads + constification of a couple of primitives in uio.h
missed by previous rounds.

Kudos to davej - his fuzzing has caught those bugs"

* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
[btrfs] fix check_direct_IO() for non-iovec iterators
constify iov_iter_count() and iter_is_iovec()
fix ITER_PIPE interaction with direct_IO

Linus Torvalds
2016-10-11 04:38:49 +0800
c3a690240 fix ITER_PIPE interaction with direct_IO ... Browse Code »

by making sure we call iov_iter_advance() on original
iov_iter even if direct_IO (done on its copy) has returned 0.
It's a no-op for old iov_iter flavours and does the right thing
(== truncation of the stuff we'd allocated, but not filled) in
ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with
by cleanup in generic_file_read_iter().

Signed-off-by: Al Viro

Al Viro
2016-10-11 01:36:06 +0800

08 Oct, 2016

2 commits

c2a9737f4 vfs,mm: fix a dead loop in truncate_inode_pages_range() ... Browse Code »

We triggered a deadloop in truncate_inode_pages_range() on 32 bits
architecture with the test case bellow:

...
fd = open();
write(fd, buf, 4096);
preadv64(fd, &iovec, 1, 0xffffffff000);
ftruncate(fd, 0);
...

Then ftruncate() will not return forever.

The filesystem used in this case is ubifs, but it can be triggered on
many other filesystems.

When preadv64() is called with offset=0xffffffff000, a page with
index=0xffffffff will be added to the radix tree of ->mapping. Then
this page can be found in ->mapping with pagevec_lookup(). After that,
truncate_inode_pages_range(), which is called in ftruncate(), will fall
into an infinite loop:

- find a page with index=0xffffffff, since index>=end, this page won't
be truncated

- index++, and index become 0

- the page with index=0xffffffff will be found again

The data type of index is unsigned long, so index won't overflow to 0 on
64 bits architecture in this case, and the dead loop won't happen.

Since truncate_inode_pages_range() is executed with holding lock of
inode->i_rwsem, any operation related with this lock will be blocked,
and a hung task will happen, e.g.:

INFO: task truncate_test:3364 blocked for more than 120 seconds.
...
call_rwsem_down_write_failed+0x17/0x30
generic_file_write_iter+0x32/0x1c0
ubifs_write_iter+0xcc/0x170
__vfs_write+0xc4/0x120
vfs_write+0xb2/0x1b0
SyS_write+0x46/0xa0

The page with index=0xffffffff added to ->mapping is useless. Fix this
by checking the read position before allocating pages.

Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
Signed-off-by: Wei Fang
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Al Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Fang
2016-10-08 09:46:29 +0800
c4b209a42 do_generic_file_read(): fail immediately if killed ... Browse Code »

If a fatal signal has been received, fail immediately instead of trying
to read more data.

If wait_on_page_locked_killable() was interrupted then this page is most
likely is not PageUptodate() and in this case do_generic_file_read()
will fail after lock_page_killable().

See also commit ebded02788b5 ("mm: filemap: avoid unnecessary calls to
lock_page when waiting for IO to complete during a read")

[oleg@redhat.com: changelog addition]
Link: http://lkml.kernel.org/r/63068e8e-8bee-b208-8441-a3c39a9d9eb6@sandisk.com
Signed-off-by: Bart Van Assche
Reviewed-by: Jan Kara
Acked-by: Oleg Nesterov
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bart Van Assche
2016-10-08 09:46:27 +0800

06 Oct, 2016

3 commits

8d3705958 Merge tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs ... Browse Code »

Pull xfs and iomap updates from Dave Chinner:
"The main things in this update are the iomap-based DAX infrastructure,
an XFS delalloc rework, and a chunk of fixes to how log recovery
schedules writeback to prevent spurious corruption detections when
recovery of certain items was not required.

The other main chunk of code is some preparation for the upcoming
reflink functionality. Most of it is generic and cleanups that stand
alone, but they were ready and reviewed so are in this pull request.

Speaking of reflink, I'm currently planning to send you another pull
request next week containing all the new reflink functionality. I'm
working through a similar process to the last cycle, where I sent the
reverse mapping code in a separate request because of how large it
was. The reflink code merge is even bigger than reverse mapping, so
I'll be doing the same thing again....

Summary for this update:

- change of XFS mailing list to linux-xfs@vger.kernel.org

- iomap-based DAX infrastructure w/ XFS and ext2 support

- small iomap fixes and additions

- more efficient XFS delayed allocation infrastructure based on iomap

- a rework of log recovery writeback scheduling to ensure we don't
fail recovery when trying to replay items that are already on disk

- some preparation patches for upcoming reflink support

- configurable error handling fixes and documentation

- aio access time update race fixes for XFS and
generic_file_read_iter"

* tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (40 commits)
fs: update atime before I/O in generic_file_read_iter
xfs: update atime before I/O in xfs_file_dio_aio_read
ext2: fix possible integer truncation in ext2_iomap_begin
xfs: log recovery tracepoints to track current lsn and buffer submission
xfs: update metadata LSN in buffers during log recovery
xfs: don't warn on buffers not being recovered due to LSN
xfs: pass current lsn to log recovery buffer validation
xfs: rework log recovery to submit buffers on LSN boundaries
xfs: quiesce the filesystem after recovery on readonly mount
xfs: remote attribute blocks aren't really userdata
ext2: use iomap to implement DAX
ext2: stop passing buffer_head to ext2_get_blocks
xfs: use iomap to implement DAX
xfs: refactor xfs_setfilesize
xfs: take the ilock shared if possible in xfs_file_iomap_begin
xfs: fix locking for DAX writes
dax: provide an iomap based fault handler
dax: provide an iomap based dax read/write path
dax: don't pass buffer_head to copy_user_dax
dax: don't pass buffer_head to dax_insert_mapping
...

Linus Torvalds
2016-10-06 23:18:10 +0800
3ddf40e8c mm: filemap: fix mapping->nrpages double accounting in fuse ... Browse Code »

Commit 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker
caused by replace_page_cache_page()") switched replace_page_cache() from
raw radix tree operations to page_cache_tree_insert() but didn't take
into account that the latter function, unlike the raw radix tree op,
handles mapping->nrpages. As a result, that counter is bumped for each
page replacement rather than balanced out even.

The mapping->nrpages counter is used to skip needless radix tree walks
when invalidating, truncating, syncing inodes without pages, as well as
statistics for userspace. Since the error is positive, we'll do more
page cache tree walks than necessary; we won't miss a necessary one.
And we'll report more buffer pages to userspace than there are. The
error is limited to fuse inodes.

Fixes: 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()")
Signed-off-by: Johannes Weiner
Cc: Andrew Morton
Cc: Miklos Szeredi
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-10-06 00:17:56 +0800
d3798ae8c mm: filemap: don't plant shadow entries without radix tree node ... Browse Code »

When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:

kernel BUG at ./include/linux/swap.h:276!
invalid opcode: 0000 [#1] SMP
Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
RIP: page_cache_tree_insert+0xf1/0x100
Call Trace:
__add_to_page_cache_locked+0x12e/0x270
add_to_page_cache_lru+0x4e/0xe0
mpage_readpages+0x112/0x1d0
blkdev_readpages+0x1d/0x20
__do_page_cache_readahead+0x1ad/0x290
force_page_cache_readahead+0xaa/0x100
page_cache_sync_readahead+0x3f/0x50
generic_file_read_iter+0x5af/0x740
blkdev_read_iter+0x35/0x40
__vfs_read+0xe1/0x130
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x13/0x8f
Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 0b e8 88 68 ef ff 0f 1f 84 00
RIP page_cache_tree_insert+0xf1/0x100

This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node->count is split in half
to count shadows in the upper bits and pages in the lower bits.

Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node->count. When there is a shadow entry
directly in root->rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node->count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node->count underflows
and triggers the warning. Afterwards, without node->count reaching 0
again, the radix tree node is leaked.

Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.

Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner
Reported-and-tested-by: Linus Torvalds
Reviewed-by: Jan Kara
Cc: Andrew Morton
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-10-06 00:17:56 +0800

03 Oct, 2016

1 commit

0d5b0cf24 fs: update atime before I/O in generic_file_read_iter ... Browse Code »

After the call to ->direct_IO the final reference to the file might have
been dropped by aio_complete already, and the call to file_accessed might
cause a use after free.

Instead update the access time before the I/O, similar to how we
update the time stamps before writes.

Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-10-03 06:48:08 +0800