Eric Lee / smarc-fsl-linux-kernel

08 Jun, 2018

1 commit

2bcd6454b mm: use new return type vm_fault_t ... Browse Code »

Use new return type vm_fault_t for fault handler in struct
vm_operations_struct. For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno. Once all
instances are converted, vm_fault_t will become a distinct type.

Link: http://lkml.kernel.org/r/20180511190542.GA2412@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder
Reviewed-by: Matthew Wilcox
Cc: Dan Williams
Cc: Jan Kara
Cc: Ross Zwisler
Cc: Rik van Riel
Cc: Matthew Wilcox
Cc: Hugh Dickins
Cc: Pavel Tatashin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Souptick Joarder
2018-06-08 08:34:36 +0800

21 Apr, 2018

1 commit

abc1be13f mm/filemap.c: fix NULL pointer in page_cache_tree_insert() ... Browse Code »

f2fs specifies the __GFP_ZERO flag for allocating some of its pages.
Unfortunately, the page cache also uses the mapping's GFP flags for
allocating radix tree nodes. It always masked off the __GFP_HIGHMEM
flag, and masks off __GFP_ZERO in some paths, but not all. That causes
radix tree nodes to be allocated with a NULL list_head, which causes
backtraces like:

__list_del_entry+0x30/0xd0
list_lru_del+0xac/0x1ac
page_cache_tree_insert+0xd8/0x110

The __GFP_DMA and __GFP_DMA32 flags would also be able to sneak through
if they are ever used. Fix them all by using GFP_RECLAIM_MASK at the
innermost location, and remove it from earlier in the callchain.

Link: http://lkml.kernel.org/r/20180411060320.14458-2-willy@infradead.org
Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Matthew Wilcox
Reported-by: Chris Fries
Debugged-by: Minchan Kim
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Reviewed-by: Jan Kara
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-04-21 08:18:36 +0800

14 Apr, 2018

1 commit

453972283 mm/filemap.c: provide dummy filemap_page_mkwrite() for NOMMU ... Browse Code »

Building orangefs on MMU-less machines now results in a link error
because of the newly introduced use of the filemap_page_mkwrite()
function:

ERROR: "filemap_page_mkwrite" [fs/orangefs/orangefs.ko] undefined!

This adds a dummy version for it, similar to the existing
generic_file_mmap and generic_file_readonly_mmap stubs in the same file,
to avoid the link error without adding #ifdefs in each file system that
uses these.

Link: http://lkml.kernel.org/r/20180409105555.2439976-1-arnd@arndb.de
Fixes: a5135eeab2e5 ("orangefs: implement vm_ops->fault")
Signed-off-by: Arnd Bergmann
Reviewed-by: Jan Kara
Reviewed-by: Andrew Morton
Cc: Martin Brandenburg
Cc: Mike Marshall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2018-04-14 08:10:27 +0800

12 Apr, 2018

1 commit

b93b01631 page cache: use xa_lock ... Browse Code »

Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Jeff Layton
Cc: Darrick J. Wong
Cc: Dave Chinner
Cc: Ryusuke Konishi
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-04-12 01:28:39 +0800

01 Feb, 2018

1 commit

2b9fceb3b mm/filemap.c: remove include of hardirq.h ... Browse Code »

in_atomic() has been moved to include/linux/preempt.h, and the filemap.c
doesn't use in_atomic() directly at all, so it sounds unnecessary to
include hardirq.h.

Link: http://lkml.kernel.org/r/1509985319-38633-1-git-send-email-yang.s@alibaba-inc.com
Signed-off-by: Yang Shi
Reviewed-by: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2018-02-01 09:18:36 +0800

17 Nov, 2017

1 commit

487e2c9f4 Merge tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs ... Browse Code »

Pull AFS updates from David Howells:
"kAFS filesystem driver overhaul.

The major points of the overhaul are:

(1) Preliminary groundwork is laid for supporting network-namespacing
of kAFS. The remainder of the namespacing work requires some way
to pass namespace information to submounts triggered by an
automount. This requires something like the mount overhaul that's
in progress.

(2) sockaddr_rxrpc is used in preference to in_addr for holding
addresses internally and add support for talking to the YFS VL
server. With this, kAFS can do everything over IPv6 as well as
IPv4 if it's talking to servers that support it.

(3) Callback handling is overhauled to be generally passive rather
than active. 'Callbacks' are promises by the server to tell us
about data and metadata changes. Callbacks are now checked when
we next touch an inode rather than actively going and looking for
it where possible.

(4) File access permit caching is overhauled to store the caching
information per-inode rather than per-directory, shared over
subordinate files. Whilst older AFS servers only allow ACLs on
directories (shared to the files in that directory), newer AFS
servers break that restriction.

To improve memory usage and to make it easier to do mass-key
removal, permit combinations are cached and shared.

(5) Cell database management is overhauled to allow lighter locks to
be used and to make cell records autonomous state machines that
look after getting their own DNS records and cleaning themselves
up, in particular preventing races in acquiring and relinquishing
the fscache token for the cell.

(6) Volume caching is overhauled. The afs_vlocation record is got rid
of to simplify things and the superblock is now keyed on the cell
and the numeric volume ID only. The volume record is tied to a
superblock and normal superblock management is used to mediate
the lifetime of the volume fscache token.

(7) File server record caching is overhauled to make server records
independent of cells and volumes. A server can be in multiple
cells (in such a case, the administrator must make sure that the
VL services for all cells correctly reflect the volumes shared
between those cells).

Server records are now indexed using the UUID of the server
rather than the address since a server can have multiple
addresses.

(8) File server rotation is overhauled to handle VMOVED, VBUSY (and
similar), VOFFLINE and VNOVOL indications and to handle rotation
both of servers and addresses of those servers. The rotation will
also wait and retry if the server says it is busy.

(9) Data writeback is overhauled. Each inode no longer stores a list
of modified sections tagged with the key that authorised it in
favour of noting the modified region of a page in page->private
and storing a list of keys that made modifications in the inode.

This simplifies things and allows other keys to be used to
actually write to the server if a key that made a modification
becomes useless.

(10) Writable mmap() is implemented. This allows a kernel to be build
entirely on AFS.

Note that Pre AFS-3.4 servers are no longer supported, though this can
be added back if necessary (AFS-3.4 was released in 1998)"

* tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits)
afs: Protect call->state changes against signals
afs: Trace page dirty/clean
afs: Implement shared-writeable mmap
afs: Get rid of the afs_writeback record
afs: Introduce a file-private data record
afs: Use a dynamic port if 7001 is in use
afs: Fix directory read/modify race
afs: Trace the sending of pages
afs: Trace the initiation and completion of client calls
afs: Fix documentation on # vs % prefix in mount source specification
afs: Fix total-length calculation for multiple-page send
afs: Only progress call state at end of Tx phase from rxrpc callback
afs: Make use of the YFS service upgrade to fully support IPv6
afs: Overhaul volume and server record caching and fileserver rotation
afs: Move server rotation code into its own file
afs: Add an address list concept
afs: Overhaul cell database management
afs: Overhaul permit caching
afs: Overhaul the callback handling
afs: Rename struct afs_call server member to cm_server
...

Linus Torvalds
2017-11-17 03:41:22 +0800

16 Nov, 2017

11 commits

453f85d43 mm: remove __GFP_COLD ... Browse Code »

As the page free path makes no distinction between cache hot and cold
pages, there is no real useful ordering of pages in the free list that
allocation requests can take advantage of. Juding from the users of
__GFP_COLD, it is likely that a number of them are the result of copying
other sites instead of actually measuring the impact. Remove the
__GFP_COLD parameter which simplifies a number of paths in the page
allocator.

This is potentially controversial but bear in mind that the size of the
per-cpu pagelists versus modern cache sizes means that the whole per-cpu
list can often fit in the L3 cache. Hence, there is only a potential
benefit for microbenchmarks that alloc/free pages in a tight loop. It's
even worse when THP is taken into account which has little or no chance
of getting a cache-hot page as the per-cpu list is bypassed and the
zeroing of multiple pages will thrash the cache anyway.

The truncate microbenchmarks are not shown as this patch affects the
allocation path and not the free path. A page fault microbenchmark was
tested but it showed no sigificant difference which is not surprising
given that the __GFP_COLD branches are a miniscule percentage of the
fault path.

Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Dave Chinner
Cc: Dave Hansen
Cc: Jan Kara
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2017-11-16 10:21:06 +0800
866798201 mm, pagevec: remove cold parameter for pagevecs ... Browse Code »

Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot. As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.

No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Dave Chinner
Cc: Dave Hansen
Cc: Jan Kara
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2017-11-16 10:21:06 +0800
c7df8ad29 mm, truncate: do not check mapping for every page being truncated ... Browse Code »

During truncation, the mapping has already been checked for shmem and
dax so it's known that workingset_update_node is required.

This patch avoids the checks on mapping for each page being truncated.
In all other cases, a lookup helper is used to determine if
workingset_update_node() needs to be called. The one danger is that the
API is slightly harder to use as calling workingset_update_node directly
without checking for dax or shmem mappings could lead to surprises.
However, the API rarely needs to be used and hopefully the comment is
enough to give people the hint.

sparsetruncate (tiny)
4.14.0-rc4 4.14.0-rc4
oneirq-v1r1 pickhelper-v1r1
Min Time 141.00 ( 0.00%) 140.00 ( 0.71%)
1st-qrtle Time 142.00 ( 0.00%) 141.00 ( 0.70%)
2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%)
Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%)
Max-95% Time 147.00 ( 0.00%) 145.00 ( 1.36%)
Max-99% Time 195.00 ( 0.00%) 191.00 ( 2.05%)
Max Time 230.00 ( 0.00%) 205.00 ( 10.87%)
Amean Time 144.37 ( 0.00%) 143.82 ( 0.38%)
Stddev Time 10.44 ( 0.00%) 9.00 ( 13.74%)
Coeff Time 7.23 ( 0.00%) 6.26 ( 13.41%)
Best99%Amean Time 143.72 ( 0.00%) 143.34 ( 0.26%)
Best95%Amean Time 142.37 ( 0.00%) 142.00 ( 0.26%)
Best90%Amean Time 142.19 ( 0.00%) 141.85 ( 0.24%)
Best75%Amean Time 141.92 ( 0.00%) 141.58 ( 0.24%)
Best50%Amean Time 141.69 ( 0.00%) 141.31 ( 0.27%)
Best25%Amean Time 141.38 ( 0.00%) 140.97 ( 0.29%)

As you'd expect, the gain is marginal but it can be detected. The
differences in bonnie are all within the noise which is not surprising
given the impact on the microbenchmark.

radix_tree_update_node_t is a callback for some radix operations that
optionally passes in a private field. The only user of the callback is
workingset_update_node and as it no longer requires a mapping, the
private field is removed.

Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Jan Kara
Cc: Andi Kleen
Cc: Dave Chinner
Cc: Dave Hansen
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2017-11-16 10:21:06 +0800
aa65c29ce mm: batch radix tree operations when truncating pages ... Browse Code »

Currently we remove pages from the radix tree one by one. To speed up
page cache truncation, lock several pages at once and free them in one
go. This allows us to batch radix tree operations in a more efficient
way and also save round-trips on mapping->tree_lock. As a result we
gain about 20% speed improvement in page cache truncation.

Data from a simple benchmark timing 10000 truncates of 1024 pages (on
ext4 on ramdisk but the filesystem is barely visible in the profiles).
The range shows 1% and 95% percentiles of the measured times:

4.14-rc2 4.14-rc2 + batched truncation
248-256 209-219
249-258 209-217
248-255 211-239
248-255 209-217
247-256 210-218

[jack@suse.cz: convert delete_from_page_cache_batch() to pagevec]
Link: http://lkml.kernel.org/r/20171018111648.13714-1-jack@suse.cz
[akpm@linux-foundation.org: move struct pagevec forward declaration to top-of-file]
Link: http://lkml.kernel.org/r/20171010151937.26984-8-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Mel Gorman
Reviewed-by: Andi Kleen
Cc: Dave Chinner
Cc: Dave Hansen
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2017-11-16 10:21:06 +0800
5ecc4d852 mm: factor out checks and accounting from __delete_from_page_cache() ... Browse Code »

Move checks and accounting updates from __delete_from_page_cache() into
a separate function. We will reuse it when batching page cache
truncation operations.

Link: http://lkml.kernel.org/r/20171010151937.26984-7-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Mel Gorman
Reviewed-by: Andi Kleen
Cc: Dave Chinner
Cc: Dave Hansen
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2017-11-16 10:21:06 +0800
2300638b1 mm: move clearing of page->mapping to page_cache_tree_delete() ... Browse Code »

Clearing of page->mapping makes sense in page_cache_tree_delete() as
well and it will help us with batching things this way.

Link: http://lkml.kernel.org/r/20171010151937.26984-6-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Mel Gorman
Reviewed-by: Andi Kleen
Cc: Dave Chinner
Cc: Dave Hansen
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2017-11-16 10:21:06 +0800
76253fbc8 mm: move accounting updates before page_cache_tree_delete() ... Browse Code »

Move updates of various counters before page_cache_tree_delete() call.
It will be easier to batch things this way and there is no difference
whether the counters get updated before or after removal from the radix
tree.

Link: http://lkml.kernel.org/r/20171010151937.26984-5-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Mel Gorman
Reviewed-by: Andi Kleen
Cc: Dave Chinner
Cc: Dave Hansen
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2017-11-16 10:21:06 +0800
59c66c5f8 mm: factor out page cache page freeing into a separate function ... Browse Code »

Factor out page freeing from delete_from_page_cache() into a separate
function. We will need to call the same when batching pagecache
deletion operations.

invalidate_complete_page2() and replace_page_cache_page() might want to
call this function as well however they currently don't seem to handle
THPs so it's unnecessary for them to take the hit of checking whether a
page is THP or not.

Link: http://lkml.kernel.org/r/20171010151937.26984-4-jack@suse.cz
Signed-off-by: Jan Kara
Acked-by: Mel Gorman
Reviewed-by: Andi Kleen
Cc: Dave Chinner
Cc: Dave Hansen
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2017-11-16 10:21:06 +0800
67fd707f4 mm: remove nr_pages argument from pagevec_lookup_{,range}_tag() ... Browse Code »

All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages. Just drop the argument.

Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara
Reviewed-by: Daniel Jordan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2017-11-16 10:21:04 +0800
312e9d2f7 mm: use pagevec_lookup_range_tag() in __filemap_fdatawait_range() ... Browse Code »

Use pagevec_lookup_range_tag() in __filemap_fdatawait_range() as it is
interested only in pages from given range. Remove unnecessary code
resulting from this.

Link: http://lkml.kernel.org/r/20171009151359.31984-11-jack@suse.cz
Signed-off-by: Jan Kara
Reviewed-by: Daniel Jordan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2017-11-16 10:21:04 +0800
72b045aec mm: implement find_get_pages_range_tag() ... Browse Code »

Patch series "Ranged pagevec tagged lookup", v3.

In this series I provide a ranged variant of pagevec_lookup_tag() and
use it in places where it makes sense. This series removes some common
code and it also has a potential for speeding up some operations
similarly as for pagevec_lookup_range() (but for now I can think of only
artificial cases where this happens).

This patch (of 16):

Implement a variant of find_get_pages_tag() that stops iterating at
given index. Lots of users of this function (through pagevec_lookup())
actually want a range lookup and all of them are currently open-coding
this.

Also create corresponding pagevec_lookup_range_tag() function.

Link: http://lkml.kernel.org/r/20171009151359.31984-2-jack@suse.cz
Signed-off-by: Jan Kara
Reviewed-by: Daniel Jordan
Cc: Bob Peterson
Cc: Chao Yu
Cc: David Howells
Cc: David Sterba
Cc: Ilya Dryomov
Cc: Jaegeuk Kim
Cc: Ryusuke Konishi
Cc: Steve French
Cc: "Theodore Ts'o"
Cc: "Yan, Zheng"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2017-11-16 10:21:03 +0800

13 Nov, 2017

1 commit

4343d0087 afs: Get rid of the afs_writeback record ... Browse Code »

Get rid of the afs_writeback record that kAFS is using to match keys with
writes made by that key.

Instead, keep a list of keys that have a file open for writing and/or
sync'ing and iterate through those.

Signed-off-by: David Howells

David Howells
2017-11-13 23:38:20 +0800

04 Oct, 2017

1 commit

f4e222c56 mm: have filemap_check_and_advance_wb_err clear AS_EIO/AS_ENOSPC ... Browse Code »

Eryu noticed that he could sometimes get a leftover error reported when
it shouldn't be on fsync with ext2 and non-journalled ext4.

The problem is that writeback_single_inode still uses filemap_fdatawait.
That picks up a previously set AS_EIO flag, which would ordinarily have
been cleared before.

Since we're mostly using this function as a replacement for
filemap_check_errors, have filemap_check_and_advance_wb_err clear AS_EIO
and AS_ENOSPC when reporting an error. That should allow the new
function to better emulate the behavior of the old with respect to these
flags.

Link: http://lkml.kernel.org/r/20170922133331.28812-1-jlayton@kernel.org
Signed-off-by: Jeff Layton
Reported-by: Eryu Guan
Reviewed-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Layton
2017-10-04 08:54:24 +0800

25 Sep, 2017

1 commit

332391a99 fs: Fix page cache inconsistency when mixing buffered and AIO DIO ... Browse Code »

Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!

Reviewed-by: Jan Kara
Reviewed-by: Darrick J. Wong
Reviewed-by: Jeff Moyer
Signed-off-by: Lukas Czerner
Signed-off-by: Jens Axboe

Lukas Czerner
2017-09-25 22:56:05 +0800

15 Sep, 2017

2 commits

e253d98f5 Merge branch 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull nowait read support from Al Viro:
"Support IOCB_NOWAIT for buffered reads and block devices"

* 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
block_dev: support RFW_NOWAIT on block device nodes
fs: support RWF_NOWAIT for buffered reads
fs: support IOCB_NOWAIT in generic_file_buffered_read
fs: pass iocb to do_generic_file_read

Linus Torvalds
2017-09-15 10:29:55 +0800
11a19c7b0 sched/wait: Introduce wakeup boomark in wake_up_page_bit ... Browse Code »

Now that we have added breaks in the wait queue scan and allow bookmark
on scan position, we put this logic in the wake_up_page_bit function.

We can have very long page wait list in large system where multiple
pages share the same wait list. We break the wake up walk here to allow
other cpus a chance to access the list, and not to disable the interrupts
when traversing the list for too long. This reduces the interrupt and
rescheduling latency, and excessive page wait queue lock hold time.

[ v2: Remove bookmark_wake_function ]

Signed-off-by: Tim Chen
Signed-off-by: Linus Torvalds

Tim Chen
2017-09-15 00:56:18 +0800

07 Sep, 2017

6 commits

d34fc1adf Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge updates from Andrew Morton:

- various misc bits

- DAX updates

- OCFS2

- most of MM

* emailed patches from Andrew Morton : (119 commits)
mm,fork: introduce MADV_WIPEONFORK
x86,mpx: make mpx depend on x86-64 to free up VMA flag
mm: add /proc/pid/smaps_rollup
mm: hugetlb: clear target sub-page last when clearing huge page
mm: oom: let oom_reap_task and exit_mmap run concurrently
swap: choose swap device according to numa node
mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
mm, oom: do not rely on TIF_MEMDIE for memory reserves access
z3fold: use per-cpu unbuddied lists
mm, swap: don't use VMA based swap readahead if HDD is used as swap
mm, swap: add sysfs interface for VMA based swap readahead
mm, swap: VMA based swap readahead
mm, swap: fix swap readahead marking
mm, swap: add swap readahead hit statistics
mm/vmalloc.c: don't reinvent the wheel but use existing llist API
mm/vmstat.c: fix wrong comment
selftests/memfd: add memfd_create hugetlbfs selftest
mm/shmem: add hugetlbfs support to memfd_create()
mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
...

Linus Torvalds
2017-09-07 11:49:49 +0800
f7b680468 mm: use find_get_pages_range() in filemap_range_has_page() ... Browse Code »

We want only pages from given range in filemap_range_has_page(),
furthermore we want at most a single page.

So use find_get_pages_range() instead of pagevec_lookup() and remove
unnecessary code.

Link: http://lkml.kernel.org/r/20170726114704.7626-10-jack@suse.cz
Signed-off-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2017-09-07 08:27:27 +0800
b947cee4b mm: implement find_get_pages_range() ... Browse Code »

Implement a variant of find_get_pages() that stops iterating at given
index. This may be substantial performance gain if the mapping is
sparse. See following commit for details. Furthermore lots of users of
this function (through pagevec_lookup()) actually want a range lookup
and all of them are currently open-coding this.

Also create corresponding pagevec_lookup_range() function.

Link: http://lkml.kernel.org/r/20170726114704.7626-4-jack@suse.cz
Signed-off-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2017-09-07 08:27:26 +0800
d72dc8a25 mm: make pagevec_lookup() update index ... Browse Code »

Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue. Most callers want this
and also pagevec_lookup_tag() already does this.

Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
Signed-off-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2017-09-07 08:27:26 +0800
d01ad197a dax: remove DAX code from page_cache_tree_insert() ... Browse Code »

Now that we no longer insert struct page pointers in DAX radix trees we
can remove the special casing for DAX in page_cache_tree_insert().

This also allows us to make dax_wake_mapping_entry_waiter() local to
fs/dax.c, removing it from dax.h.

Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler
Suggested-by: Jan Kara
Reviewed-by: Jan Kara
Cc: "Darrick J. Wong"
Cc: "Theodore Ts'o"
Cc: Alexander Viro
Cc: Andreas Dilger
Cc: Christoph Hellwig
Cc: Dan Williams
Cc: Dave Chinner
Cc: Ingo Molnar
Cc: Jonathan Corbet
Cc: Matthew Wilcox
Cc: Steven Rostedt
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ross Zwisler
2017-09-07 08:27:24 +0800
ec3604c7a Merge tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux ... Browse Code »

Pull writeback error handling updates from Jeff Layton:
"This pile continues the work from last cycle on better tracking
writeback errors. In v4.13 we added some basic errseq_t infrastructure
and converted a few filesystems to use it.

This set continues refining that infrastructure, adds documentation,
and converts most of the other filesystems to use it. The main
exception at this point is the NFS client"

* tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
ecryptfs: convert to file_write_and_wait in ->fsync
mm: remove optimizations based on i_size in mapping writeback waits
fs: convert a pile of fsync routines to errseq_t based reporting
gfs2: convert to errseq_t based writeback error reporting for fsync
fs: convert sync_file_range to use errseq_t based error-tracking
mm: add file_fdatawait_range and file_write_and_wait
fuse: convert to errseq_t based error tracking for fsync
mm: consolidate dax / non-dax checks for writeback
Documentation: add some docs for errseq_t
errseq: rename __errseq_set to errseq_set

Linus Torvalds
2017-09-07 05:11:03 +0800

05 Sep, 2017

2 commits

3239d8348 fs: support IOCB_NOWAIT in generic_file_buffered_read ... Browse Code »

Allow generic_file_buffered_read to bail out early instead of waiting for
the page lock or reading a page if IOCB_NOWAIT is specified.

Signed-off-by: Milosz Tanski
Reviewed-by: Christoph Hellwig
Reviewed-by: Jeff Moyer
Acked-by: Sage Weil
Signed-off-by: Al Viro

Milosz Tanski
2017-09-05 07:04:22 +0800
47c27bc46 fs: pass iocb to do_generic_file_read ... Browse Code »

And rename it to the more descriptive generic_file_buffered_read while
at it.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Al Viro

Christoph Hellwig
2017-09-05 07:04:22 +0800

29 Aug, 2017

1 commit

9c3a815f4 page waitqueue: always add new entries at the end ... Browse Code »

Commit 3510ca20ece0 ("Minor page waitqueue cleanups") made the page
queue code always add new waiters to the back of the queue, which helps
upcoming patches to batch the wakeups for some horrid loads where the
wait queues grow to thousands of entries.

However, I forgot about the nasrt add_page_wait_queue() special case
code that is only used by the cachefiles code. That one still continued
to add the new wait queue entries at the beginning of the list.

Fix it, because any sane batched wakeup will require that we don't
suddenly start getting new entries at the beginning of the list that we
already handled in a previous batch.

[ The current code always does the whole list while holding the lock, so
wait queue ordering doesn't matter for correctness, but even then it's
better to add later entries at the end from a fairness standpoint ]

Signed-off-by: Linus Torvalds

Linus Torvalds
2017-08-29 07:45:40 +0800

28 Aug, 2017

2 commits

a8b169afb Avoid page waitqueue race leaving possible page locker waiting ... Browse Code »

The "lock_page_killable()" function waits for exclusive access to the
page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry
set.

That means that if it gets woken up, other waiters may have been
skipped.

That, in turn, means that if it sees the page being unlocked, it *must*
take that lock and return success, even if a lethal signal is also
pending.

So instead of checking for lethal signals first, we need to check for
them after we've checked the actual bit that we were waiting for. Even
if that might then delay the killing of the process.

This matches the order of the old "wait_on_bit_lock()" infrastructure
that the page locking used to use (and is still used in a few other
areas).

Note that if we still return an error after having unsuccessfully tried
to acquire the page lock, that is ok: that means that some other thread
was able to get ahead of us and lock the page, and when that other
thread then unlocks the page, the wakeup event will be repeated. So any
other pending waiters will now get properly woken up.

Fixes: 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit")
Cc: Nick Piggin
Cc: Peter Zijlstra
Cc: Mel Gorman
Cc: Jan Kara
Cc: Davidlohr Bueso
Cc: Andi Kleen
Signed-off-by: Linus Torvalds

Linus Torvalds
2017-08-28 07:25:09 +0800
3510ca20e Minor page waitqueue cleanups ... Browse Code »

Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists. The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.

Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes. That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.

In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably. If you have thousands of threads
waiting for the same page, it will be painful. We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.

That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:

(a) we don't want to continue walking the page wait list if the bit
we're waiting for already got set again (which seems to be one of
the patterns of the bad load). That makes no progress and just
causes pointless cache pollution chasing the pointers.

(b) we don't want to put the non-locking waiters always on the front of
the queue, and the locking waiters always on the back. Not only is
that unfair, it means that we wake up thousands of reading threads
that will just end up being blocked by the writer later anyway.

Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members. It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.

Cc: Tim Chen
Cc: Kan Liang
Cc: Mel Gorman
Cc: Christopher Lameter
Cc: Andi Kleen
Cc: Davidlohr Bueso
Cc: Peter Zijlstra
Signed-off-by: Linus Torvalds

Linus Torvalds
2017-08-28 04:55:12 +0800

01 Aug, 2017

2 commits

ffb959bbd mm: remove optimizations based on i_size in mapping writeback waits ... Browse Code »

Marcelo added this i_size based optimization with a patch in 2004
(commitid is from the linux-history tree):

commit 765dad09b4ac101a32d87af2bb793c3060497d3c
Author: Marcelo Tosatti
Date: Tue Sep 7 17:51:17 2004 -0700

small wait_on_page_writeback_range() optimization

filemap_fdatawait() calls wait_on_page_writeback_range() with -1
as "end" parameter. This is not needed since we know the EOF
from the inode. Use that instead.

There may be races here, particularly with clustered or network
filesystems. It also seems like a bit of a layering violation since
we're operating on an address_space here, not an inode.

Finally, it's also questionable whether this optimization really helps
on workloads that we care about. Should we be optimizing for writeback
vs. truncate races in a codepath where we expect to wait anyway? It
doesn't seem worth the risk.

Remove this optimization from the filemap_fdatawait codepaths. This
means that filemap_fdatawait becomes a trivial wrapper around
filemap_fdatawait_range.

Reviewed-by: Jan Kara
Signed-off-by: Jeff Layton

Jeff Layton
2017-08-01 20:39:29 +0800
a823e4589 mm: add file_fdatawait_range and file_write_and_wait ... Browse Code »

Necessary now for gfs2_fsync and sync_file_range, but there will
eventually be other callers.

Reviewed-by: Jan Kara
Signed-off-by: Jeff Layton

Jeff Layton
2017-08-01 07:12:26 +0800

29 Jul, 2017

1 commit

9326c9b20 mm: consolidate dax / non-dax checks for writeback ... Browse Code »

We have this complex conditional copied to several places. Turn it into
a helper function.

Reviewed-by: Jan Kara
Signed-off-by: Jeff Layton

Jeff Layton
2017-07-29 21:01:02 +0800

27 Jul, 2017

1 commit

3acdfd280 errseq: rename __errseq_set to errseq_set ... Browse Code »

Nothing calls this wrapper anymore, so just remove it and rename the
old function to get rid of the double underscore prefix.

Signed-off-by: Jeff Layton

Jeff Layton
2017-07-27 00:24:36 +0800

11 Jul, 2017

1 commit

09612fa65 mm: hugetlb: return immediately for hugetlb page in __delete_from_page_cache() ... Browse Code »

We avoid calling __mod_node_page_state(NR_FILE_PAGES) for hugetlb page
now, but it's not enough because later code doesn't handle hugetlb
properly. Actually in our testing, WARN_ON_ONCE(PageDirty(page)) at the
end of this function fires for hugetlb, which makes no sense. So we
should return immediately for hugetlb pages.

Link: http://lkml.kernel.org/r/1496305019-5493-3-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi
Cc: Michal Hocko
Cc: "Aneesh Kumar K.V"
Cc: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2017-07-11 07:32:30 +0800

08 Jul, 2017

1 commit

088737f44 Merge tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux ... Browse Code »

Pull Writeback error handling updates from Jeff Layton:
"This pile represents the bulk of the writeback error handling fixes
that I have for this cycle. Some of the earlier patches in this pile
may look trivial but they are prerequisites for later patches in the
series.

The aim of this set is to improve how we track and report writeback
errors to userland. Most applications that care about data integrity
will periodically call fsync/fdatasync/msync to ensure that their
writes have made it to the backing store.

For a very long time, we have tracked writeback errors using two flags
in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
writeback error occurs (via mapping_set_error) and are cleared as a
side-effect of filemap_check_errors (as you noted yesterday). This
model really sucks for userland.

Only the first task to call fsync (or msync or fdatasync) will see the
error. Any subsequent task calling fsync on a file will get back 0
(unless another writeback error occurs in the interim). If I have
several tasks writing to a file and calling fsync to ensure that their
writes got stored, then I need to have them coordinate with one
another. That's difficult enough, but in a world of containerized
setups that coordination may even not be possible.

But wait...it gets worse!

The calls to filemap_check_errors can be buried pretty far down in the
call stack, and there are internal callers of filemap_write_and_wait
and the like that also end up clearing those errors. Many of those
callers ignore the error return from that function or return it to
userland at nonsensical times (e.g. truncate() or stat()). If I get
back -EIO on a truncate, there is no reason to think that it was
because some previous writeback failed, and a subsequent fsync() will
(incorrectly) return 0.

This pile aims to do three things:

1) ensure that when a writeback error occurs that that error will be
reported to userland on a subsequent fsync/fdatasync/msync call,
regardless of what internal callers are doing

2) report writeback errors on all file descriptions that were open at
the time that the error occurred. This is a user-visible change,
but I think most applications are written to assume this behavior
anyway. Those that aren't are unlikely to be hurt by it.

3) document what filesystems should do when there is a writeback
error. Today, there is very little consistency between them, and a
lot of cargo-cult copying. We need to make it very clear what
filesystems should do in this situation.

To achieve this, the set adds a new data type (errseq_t) and then
builds new writeback error tracking infrastructure around that. Once
all of that is in place, we change the filesystems to use the new
infrastructure for reporting wb errors to userland.

Note that this is just the initial foray into cleaning up this mess.
There is a lot of work remaining here:

1) convert the rest of the filesystems in a similar fashion. Once the
initial set is in, then I think most other fs' will be fairly
simple to convert. Hopefully most of those can in via individual
filesystem trees.

2) convert internal waiters on writeback to use errseq_t for
detecting errors instead of relying on the AS_* flags. I have some
draft patches for this for ext4, but they are not quite ready for
prime time yet.

This was a discussion topic this year at LSF/MM too. If you're
interested in the gory details, LWN has some good articles about this:

https://lwn.net/Articles/718734/
https://lwn.net/Articles/724307/"

* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
btrfs: minimal conversion to errseq_t writeback error reporting on fsync
xfs: minimal conversion to errseq_t writeback error reporting
ext4: use errseq_t based error handling for reporting data writeback errors
fs: convert __generic_file_fsync to use errseq_t based reporting
block: convert to errseq_t based writeback error tracking
dax: set errors in mapping when writeback fails
Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
fs: new infrastructure for writeback error handling and reporting
lib: add errseq_t type and infrastructure for handling it
mm: don't TestClearPageError in __filemap_fdatawait_range
mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
jbd2: don't clear and reset errors after waiting on writeback
buffer: set errors in mapping at the time that the error occurs
fs: check for writeback errors after syncing out buffers in generic_file_fsync
buffer: use mapping_set_error instead of setting the flag
mm: fix mapping_set_error call in me_pagecache_dirty

Linus Torvalds
2017-07-08 10:38:17 +0800

07 Jul, 2017

1 commit

2262185c5 mm: per-cgroup memory reclaim stats ... Browse Code »

Track the following reclaim counters for every memory cgroup: PGREFILL,
PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

These values are exposed using the memory.stats interface of cgroup v2.

The meaning of each value is the same as for global counters, available
using /proc/vmstat.

Also, for consistency, rename mem_cgroup_count_vm_event() to
count_memcg_event_mm().

Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
Signed-off-by: Roman Gushchin
Suggested-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Tejun Heo
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2017-07-07 07:24:35 +0800