Eric Lee / smarc-fsl-linux-kernel

17 Oct, 2015

1 commit

0f90cc660 mm, dax: fix DAX deadlocks ... Browse Code »

The following two locking commits in the DAX code:

commit 843172978bb9 ("dax: fix race between simultaneous faults")
commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")

introduced a number of deadlocks and other issues which need to be fixed
for the v4.3 kernel. The list of issues in DAX after these commits
(some newly introduced by the commits, some preexisting) can be found
here:

https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault").

This undoes most of the changes introduced by those two commits,
essentially returning us to the DAX locking scheme that was used in
v4.2.

Signed-off-by: Ross Zwisler
Cc: Alexander Viro
Cc: Dan Williams
Tested-by: Dave Chinner
Cc: Jan Kara
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ross Zwisler
2015-10-17 02:42:28 +0800

11 Sep, 2015

1 commit

fb6dd5fa4 mm: use vma_is_anonymous() in create_huge_pmd() and wp_huge_pmd() ... Browse Code »

Let's use helper rather than direct check of vma->vm_ops to distinguish
anonymous VMA.

Signed-off-by: Kirill A. Shutemov
Reviewed-by: Oleg Nesterov
Cc: "H. Peter Anvin"
Cc: Andy Lutomirski
Cc: Dave Hansen
Cc: Ingo Molnar
Cc: Minchan Kim
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-09-11 04:29:01 +0800

09 Sep, 2015

5 commits

52a2b53ff mm, dax: use i_mmap_unlock_write() in do_cow_fault() ... Browse Code »

__dax_fault() takes i_mmap_lock for write. Let's pair it with write
unlock on do_cow_fault() side.

Signed-off-by: Kirill A. Shutemov
Acked-by: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-09-09 06:35:28 +0800
46c043ede mm: take i_mmap_lock in unmap_mapping_range() for DAX ... Browse Code »

DAX is not so special: we need i_mmap_lock to protect mapping->i_mmap.

__dax_pmd_fault() uses unmap_mapping_range() shoot out zero page from
all mappings. We need to drop i_mmap_lock there to avoid lock deadlock.

Re-aquiring the lock should be fine since we check i_size after the
point.

Signed-off-by: Kirill A. Shutemov
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-09-09 06:35:28 +0800
843172978 dax: fix race between simultaneous faults ... Browse Code »

If two threads write-fault on the same hole at the same time, the winner
of the race will return to userspace and complete their store, only to
have the loser overwrite their store with zeroes. Fix this for now by
taking the i_mmap_sem for write instead of read, and do so outside the
call to get_block(). Now the loser of the race will see the block has
already been zeroed, and will not zero it again.

This severely limits our scalability. I have ideas for improving it, but
those can wait for a later patch.

Signed-off-by: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2015-09-09 06:35:28 +0800
b96375f74 mm: add a pmd_fault handler ... Browse Code »

Allow non-anonymous VMAs to provide huge pages in response to a page fault.

Signed-off-by: Matthew Wilcox
Cc: Hillf Danton
Cc: "Kirill A. Shutemov"
Cc: Theodore Ts'o
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2015-09-09 06:35:28 +0800
b53306285 mm: introduce vma_is_anonymous(vma) helper ... Browse Code »

special_mapping_fault() is absolutely broken. It seems it was always
wrong, but this didn't matter until vdso/vvar started to use more than
one page.

And after this change vma_is_anonymous() becomes really trivial, it
simply checks vm_ops == NULL. However, I do think the helper makes
sense. There are a lot of ->vm_ops != NULL checks, the helper makes the
caller's code more understandable (self-documented) and this is more
grep-friendly.

This patch (of 3):

Preparation. Add the new simple helper, vma_is_anonymous(vma), and change
handle_pte_fault() to use it. It will have more users.

The name is not accurate, say a hpet_mmap()'ed vma is not anonymous.
Perhaps it should be named vma_has_fault() instead. But it matches the
logic in mmap.c/memory.c (see next changes). "True" just means that a
page fault will use do_anonymous_page().

Signed-off-by: Oleg Nesterov
Acked-by: Kirill A. Shutemov
Cc: Andy Lutomirski
Cc: Hugh Dickins
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2015-09-09 06:35:28 +0800

05 Sep, 2015

2 commits

ca1d6c7d9 mm/memory.c: make tlb_next_batch() return bool ... Browse Code »

This makes the tlb_next_batch() bool due to this particular function only
ever returning either one or zero as its return value.

Signed-off-by: Nicholas Krause
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicholas Krause
2015-09-05 07:54:41 +0800
6b251fc96 userfaultfd: call handle_userfault() for userfaultfd_missing() faults ... Browse Code »

This is where the page faults must be modified to call
handle_userfault() if userfaultfd_missing() is true (so if the
vma->vm_flags had VM_UFFD_MISSING set).

handle_userfault() then takes care of blocking the page fault and
delivering it to userland.

The fault flags must also be passed as parameter so the "read|write"
kind of fault can be passed to userland.

Signed-off-by: Andrea Arcangeli
Acked-by: Pavel Emelyanov
Cc: Sanidhya Kashyap
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov"
Cc: Andres Lagar-Cavilla
Cc: Dave Hansen
Cc: Paolo Bonzini
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Andy Lutomirski
Cc: Hugh Dickins
Cc: Peter Feiner
Cc: "Dr. David Alan Gilbert"
Cc: Johannes Weiner
Cc: "Huangpeng (Peter)"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-09-05 07:54:41 +0800

10 Jul, 2015

1 commit

6b7339f4c mm: avoid setting up anonymous pages into file mapping ... Browse Code »

Reading page fault handler code I've noticed that under right
circumstances kernel would map anonymous pages into file mappings: if
the VMA doesn't have vm_ops->fault() and the VMA wasn't fully populated
on ->mmap(), kernel would handle page fault to not populated pte with
do_anonymous_page().

Let's change page fault handler to use do_anonymous_page() only on
anonymous VMA (->vm_ops == NULL) and make sure that the VMA is not
shared.

For file mappings without vm_ops->fault() or shred VMA without vm_ops,
page fault on pte_none() entry would lead to SIGBUS.

Signed-off-by: Kirill A. Shutemov
Acked-by: Oleg Nesterov
Cc: Andrew Morton
Cc: Willy Tarreau
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-07-10 02:12:48 +0800

05 Jul, 2015

1 commit

1dc51b828 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull more vfs updates from Al Viro:
"Assorted VFS fixes and related cleanups (IMO the most interesting in
that part are f_path-related things and Eric's descriptor-related
stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups". The
file_table locking rule change by Eric Dumazet is a rather big and
fundamental update even if the patch isn't huge. - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
9p: cope with bogus responses from server in p9_client_{read,write}
p9_client_write(): avoid double p9_free_req()
9p: forgetting to cancel request on interrupted zero-copy RPC
dax: bdev_direct_access() may sleep
block: Add support for DAX reads/writes to block devices
dax: Use copy_from_iter_nocache
dax: Add block size note to documentation
fs/file.c: __fget() and dup2() atomicity rules
fs/file.c: don't acquire files->file_lock in fd_install()
fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
vfs: avoid creation of inode number 0 in get_next_ino
namei: make set_root_rcu() return void
make simple_positive() public
ufs: use dir_pages instead of ufs_dir_pages()
pagemap.h: move dir_pages() over there
remove the pointless include of lglock.h
fs: cleanup slight list_entry abuse
xfs: Correctly lock inode when removing suid and file capabilities
fs: Call security_ops->inode_killpriv on truncate
fs: Provide function telling whether file_remove_privs() will do anything
...

Linus Torvalds
2015-07-05 10:36:06 +0800

25 Jun, 2015

1 commit

eb3c24f30 mm, memcg: Try charging a page before setting page up to date ... Browse Code »

Historically memcg overhead was high even if memcg was unused. This has
improved a lot but it still showed up in a profile summary as being a
problem.

/usr/src/linux-4.0-vanilla/mm/memcontrol.c 6.6441 395842
mem_cgroup_try_charge 2.950% 175781
__mem_cgroup_count_vm_event 1.431% 85239
mem_cgroup_page_lruvec 0.456% 27156
mem_cgroup_commit_charge 0.392% 23342
uncharge_list 0.323% 19256
mem_cgroup_update_lru_size 0.278% 16538
memcg_check_events 0.216% 12858
mem_cgroup_charge_statistics.isra.22 0.188% 11172
try_charge 0.150% 8928
commit_charge 0.141% 8388
get_mem_cgroup_from_mm 0.121% 7184

That is showing that 6.64% of system CPU cycles were in memcontrol.c and
dominated by mem_cgroup_try_charge. The annotation shows that the bulk
of the cost was checking PageSwapCache which is expected to be cache hot
but is very expensive. The problem appears to be that __SetPageUptodate
is called just before the check which is a write barrier. It is
required to make sure struct page and page data is written before the
PTE is updated and the data visible to userspace. memcg charging does
not require or need the barrier but gets unfairly hit with the cost so
this patch attempts the charging before the barrier. Aside from the
accidental cost to memcg there is the added benefit that the barrier is
avoided if the page cannot be charged. When applied the relevant
profile summary is as follows.

/usr/src/linux-4.0-chargefirst-v2r1/mm/memcontrol.c 3.7907 223277
__mem_cgroup_count_vm_event 1.143% 67312
mem_cgroup_page_lruvec 0.465% 27403
mem_cgroup_commit_charge 0.381% 22452
uncharge_list 0.332% 19543
mem_cgroup_update_lru_size 0.284% 16704
get_mem_cgroup_from_mm 0.271% 15952
mem_cgroup_try_charge 0.237% 13982
memcg_check_events 0.222% 13058
mem_cgroup_charge_statistics.isra.22 0.185% 10920
commit_charge 0.140% 8235
try_charge 0.131% 7716

That brings the overhead down to 3.79% and leaves the memcg fault
accounting to the root cgroup but it's an improvement. The difference
in headline performance of the page fault microbench is marginal as
memcg is such a small component of it.

pft faults
4.0.0 4.0.0
vanilla chargefirst
Hmean faults/cpu-1 1443258.1051 ( 0.00%) 1509075.7561 ( 4.56%)
Hmean faults/cpu-3 1340385.9270 ( 0.00%) 1339160.7113 ( -0.09%)
Hmean faults/cpu-5 875599.0222 ( 0.00%) 874174.1255 ( -0.16%)
Hmean faults/cpu-7 601146.6726 ( 0.00%) 601370.9977 ( 0.04%)
Hmean faults/cpu-8 510728.2754 ( 0.00%) 510598.8214 ( -0.03%)
Hmean faults/sec-1 1432084.7845 ( 0.00%) 1497935.5274 ( 4.60%)
Hmean faults/sec-3 3943818.1437 ( 0.00%) 3941920.1520 ( -0.05%)
Hmean faults/sec-5 3877573.5867 ( 0.00%) 3869385.7553 ( -0.21%)
Hmean faults/sec-7 3991832.0418 ( 0.00%) 3992181.4189 ( 0.01%)
Hmean faults/sec-8 3987189.8167 ( 0.00%) 3986452.2204 ( -0.02%)

It's only visible at single threaded. The overhead is there for higher
threads but other factors dominate.

Signed-off-by: Mel Gorman
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-06-25 08:49:43 +0800

24 Jun, 2015

1 commit

9bf39ab2a vfs: add file_path() helper ... Browse Code »

Turn
d_path(&file->f_path, ...);
into
file_path(file, ...);

Signed-off-by: Miklos Szeredi
Signed-off-by: Al Viro

Miklos Szeredi
2015-06-24 06:00:05 +0800

19 May, 2015

1 commit

9ec23531f sched/preempt, mm/fault: Trigger might_sleep() in might_fault() with disabled pagefaults ... Browse Code »

Commit 662bbcb2747c ("mm, sched: Allow uaccess in atomic with
pagefault_disable()") removed might_sleep() checks for all user access
code (that uses might_fault()).

The reason was to disable wrong "sleep in atomic" warnings in the
following scenario:

pagefault_disable()
rc = copy_to_user(...)
pagefault_enable()

Which is valid, as pagefault_disable() increments the preempt counter
and therefore disables the pagefault handler. copy_to_user() will not
sleep and return an error code if a page is not available.

However, as all might_sleep() checks are removed,
CONFIG_DEBUG_ATOMIC_SLEEP would no longer detect the following scenario:

spin_lock(&lock);
rc = copy_to_user(...)
spin_unlock(&lock)

If the kernel is compiled with preemption turned on, preempt_disable()
will make in_atomic() detect disabled preemption. The fault handler would
correctly never sleep on user access.
However, with preemption turned off, preempt_disable() is usually a NOP
(with !CONFIG_PREEMPT_COUNT), therefore in_atomic() will not be able to
detect disabled preemption nor disabled pagefaults. The fault handler
could sleep.
We really want to enable CONFIG_DEBUG_ATOMIC_SLEEP checks for user access
functions again, otherwise we can end up with horrible deadlocks.

Root of all evil is that pagefault_disable() acts almost as
preempt_disable(), depending on preemption being turned on/off.

As we now have pagefault_disabled(), we can use it to distinguish
whether user acces functions might sleep.

Convert might_fault() into a makro that calls __might_fault(), to
allow proper file + line messages in case of a might_sleep() warning.

Reviewed-and-tested-by: Thomas Gleixner
Signed-off-by: David Hildenbrand
Signed-off-by: Peter Zijlstra (Intel)
Cc: David.Laight@ACULAB.COM
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: airlied@linux.ie
Cc: akpm@linux-foundation.org
Cc: benh@kernel.crashing.org
Cc: bigeasy@linutronix.de
Cc: borntraeger@de.ibm.com
Cc: daniel.vetter@intel.com
Cc: heiko.carstens@de.ibm.com
Cc: herbert@gondor.apana.org.au
Cc: hocko@suse.cz
Cc: hughd@google.com
Cc: mst@redhat.com
Cc: paulus@samba.org
Cc: ralf@linux-mips.org
Cc: schwidefsky@de.ibm.com
Cc: yang.shi@windriver.com
Link: http://lkml.kernel.org/r/1431359540-32227-3-git-send-email-dahi@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

David Hildenbrand
2015-05-19 14:39:14 +0800

16 Apr, 2015

3 commits

dd9061846 mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP ... Browse Code »

This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it to
page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario above
we fail to set ctime/mtime though we modified the file. An xfstest is
attached to this patchset that shows the failure and the fix. (A DAX
patch will follow)

This functionality is extra important for us, because upon dirtying of a
pmem page we also want to RDMA the page to a remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
1 - The name ;-)
2 - But mainly because it would take a very long and tedious
audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
users. To make sure they do not now CRASH. For example current
DAX code (which this is for) would crash.
If we would want to reuse page_mkwrite, We will need to first
patch all users, so to not-crash-on-no-page. Then enable this
patch. But even if I did that I would not sleep so well at night.
Adding a new vector is the safest thing to do, and is not that
expensive. an extra pointer at a static function vector per driver.
Also the new vector is better for performance, because else we
Will call all current Kernel vectors, so to:
check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

Signed-off-by: Yigal Korman
Signed-off-by: Boaz Harrosh
Acked-by: Kirill A. Shutemov
Cc: Matthew Wilcox
Cc: Jan Kara
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Boaz Harrosh
2015-04-16 07:35:20 +0800
2682582a6 mm/memory: also print a_ops->readpage in print_bad_pte() ... Browse Code »

A lot of filesystems use generic_file_mmap() and filemap_fault(),
f_op->mmap and vm_ops->fault aren't enough to identify filesystem.

This prints file name, vm_ops->fault, f_op->mmap and a_ops->readpage
(which is almost always implemented and filesystem-specific).

Example:

[ 23.676410] BUG: Bad page map in process sh pte:1b7e6025 pmd:19bbd067
[ 23.676887] page:ffffea00006df980 count:4 mapcount:1 mapping:ffff8800196426c0 index:0x97
[ 23.677481] flags: 0x10000000000000c(referenced|uptodate)
[ 23.677896] page dumped because: bad pte
[ 23.678205] addr:00007f52fcb17000 vm_flags:00000075 anon_vma: (null) mapping:ffff8800196426c0 index:97
[ 23.678922] file:libc-2.19.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:v9fs_vfs_readpage

[akpm@linux-foundation.org: use pr_alert, per Kirill]
Signed-off-by: Konstantin Khlebnikov
Cc: Sasha Levin
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2015-04-16 07:35:20 +0800
4db0c3c29 mm: remove rest of ACCESS_ONCE() usages ... Browse Code »

We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
tree since it doesn't work reliably on non-scalar types.

This patch removes the rest of the usages of ACCESS_ONCE, and use the new
READ_ONCE API for the read accesses. This makes things cleaner, instead
of using separate/multiple sets of APIs.

Signed-off-by: Jason Low
Acked-by: Michal Hocko
Acked-by: Davidlohr Bueso
Acked-by: Rik van Riel
Reviewed-by: Christian Borntraeger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Low
2015-04-16 07:35:18 +0800

15 Apr, 2015

4 commits

93e478d4c mm: refactor do_wp_page handling of shared vma into a function ... Browse Code »

The do_wp_page function is extremely long. Extract the logic for
handling a page belonging to a shared vma into a function of its own.

This helps the readability of the code, without doing any functional
change in it.

Signed-off-by: Shachar Raindel
Acked-by: Linus Torvalds
Acked-by: Kirill A. Shutemov
Acked-by: Rik van Riel
Acked-by: Andi Kleen
Acked-by: Haggai Eran
Acked-by: Johannes Weiner
Cc: Mel Gorman
Cc: Matthew Wilcox
Cc: Dave Hansen
Cc: Naoya Horiguchi
Cc: Andrea Arcangeli
Cc: Peter Feiner
Cc: Michel Lespinasse
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shachar Raindel
2015-04-15 07:49:03 +0800
2f38ab2c3 mm: refactor do_wp_page, extract the page copy flow ... Browse Code »

In some cases, do_wp_page had to copy the page suffering a write fault
to a new location. If the function logic decided that to do this, it
was done by jumping with a "goto" operation to the relevant code block.
This made the code really hard to understand. It is also against the
kernel coding style guidelines.

This patch extracts the page copy and page table update logic to a
separate function. It also clean up the naming, from "gotten" to
"wp_page_copy", and adds few comments.

Signed-off-by: Shachar Raindel
Acked-by: Linus Torvalds
Acked-by: Kirill A. Shutemov
Acked-by: Rik van Riel
Acked-by: Andi Kleen
Acked-by: Haggai Eran
Acked-by: Johannes Weiner
Cc: Mel Gorman
Cc: Matthew Wilcox
Cc: Dave Hansen
Cc: Naoya Horiguchi
Cc: Andrea Arcangeli
Cc: Peter Feiner
Cc: Michel Lespinasse
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shachar Raindel
2015-04-15 07:49:03 +0800
287668052 mm: refactor do_wp_page - rewrite the unlock flow ... Browse Code »

When do_wp_page is ending, in several cases it needs to unlock the pages
and ptls it was accessing.

Currently, this logic was "called" by using a goto jump. This makes
following the control flow of the function harder. Readability was
further hampered by the unlock case containing large amount of logic
needed only in one of the 3 cases.

Using goto for cleanup is generally allowed. However, moving the
trivial unlocking flows to the relevant call sites allow deeper
refactoring in the next patch.

Signed-off-by: Shachar Raindel
Acked-by: Linus Torvalds
Acked-by: Kirill A. Shutemov
Acked-by: Rik van Riel
Acked-by: Andi Kleen
Acked-by: Haggai Eran
Acked-by: Johannes Weiner
Cc: Mel Gorman
Cc: Matthew Wilcox
Cc: Dave Hansen
Cc: Naoya Horiguchi
Cc: Andrea Arcangeli
Cc: Peter Feiner
Cc: Michel Lespinasse
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shachar Raindel
2015-04-15 07:49:03 +0800
4e047f897 mm: refactor do_wp_page, extract the reuse case ... Browse Code »

Currently do_wp_page contains 265 code lines. It also contains 9 goto
statements, of which 5 are targeting labels which are not cleanup
related. This makes the function extremely difficult to understand.

The following patches are an attempt at breaking the function to its
basic components, and making it easier to understand.

The patches are straight forward function extractions from do_wp_page.
As we extract functions, we remove unneeded parameters and simplify the
code as much as possible. However, the functionality is supposed to
remain completely unchanged. The patches also attempt to document the
functionality of each extracted function. In patch 2, we split the
unlock logic to the contain logic relevant to specific needs of each use
case, instead of having huge number of conditional decisions in a single
unlock flow.

This patch (of 4):

When do_wp_page is ending, in several cases it needs to reuse the existing
page. This is achieved by making the page table writable, and possibly
updating the page-cache state.

Currently, this logic was "called" by using a goto jump. This makes
following the control flow of the function harder. It is also against the
coding style guidelines for using goto.

As the code can easily be refactored into a specialized function, refactor
it out and simplify the code flow in do_wp_page.

Acked-by: Linus Torvalds
Acked-by: Kirill A. Shutemov
Acked-by: Rik van Riel
Acked-by: Andi Kleen
Acked-by: Haggai Eran
Acked-by: Johannes Weiner
Cc: Mel Gorman
Cc: Matthew Wilcox
Cc: Dave Hansen
Cc: Naoya Horiguchi
Cc: Andrea Arcangeli
Cc: Peter Feiner
Cc: Michel Lespinasse
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shachar Raindel
2015-04-15 07:49:03 +0800

26 Mar, 2015

3 commits

074c23817 mm: numa: slow PTE scan rate if migration failures occur ... Browse Code »

Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

Across the board the 4.0-rc1 numbers are much slower, and the degradation
is far worse when using the large memory footprint configs. Perf points
straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

- 56.07% 56.07% [kernel] [k] default_send_IPI_mask_sequence_phys
- default_send_IPI_mask_sequence_phys
- 99.99% physflat_send_IPI_mask
- 99.37% native_send_call_func_ipi
smp_call_function_many
- native_flush_tlb_others
- 99.85% flush_tlb_page
ptep_clear_flush
try_to_unmap_one
rmap_walk
try_to_unmap
migrate_pages
migrate_misplaced_page
- handle_mm_fault
- 99.73% __do_page_fault
trace_do_page_fault
do_async_page_fault
+ async_page_fault
0.63% native_send_call_func_single_ipi
generic_exec_single
smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled. Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults. However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote. This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen. This patch tracks when migration failures occur and
slows the PTE scanner.

Signed-off-by: Mel Gorman
Reported-by: Dave Chinner
Tested-by: Dave Chinner
Cc: Ingo Molnar
Cc: Aneesh Kumar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-03-26 07:20:31 +0800
b191f9b10 mm: numa: preserve PTE write permissions across a NUMA hinting fault ... Browse Code »

Protecting a PTE to trap a NUMA hinting fault clears the writable bit
and further faults are needed after trapping a NUMA hinting fault to set
the writable bit again. This patch preserves the writable bit when
trapping NUMA hinting faults. The impact is obvious from the number of
minor faults trapped during the basis balancing benchmark and the system
CPU usage;

autonumabench
4.0.0-rc4 4.0.0-rc4
baseline preserve
Time System-NUMA01 107.13 ( 0.00%) 103.13 ( 3.73%)
Time System-NUMA01_THEADLOCAL 131.87 ( 0.00%) 83.30 ( 36.83%)
Time System-NUMA02 8.95 ( 0.00%) 10.72 (-19.78%)
Time System-NUMA02_SMT 4.57 ( 0.00%) 3.99 ( 12.69%)
Time Elapsed-NUMA01 515.78 ( 0.00%) 517.26 ( -0.29%)
Time Elapsed-NUMA01_THEADLOCAL 384.10 ( 0.00%) 384.31 ( -0.05%)
Time Elapsed-NUMA02 48.86 ( 0.00%) 48.78 ( 0.16%)
Time Elapsed-NUMA02_SMT 47.98 ( 0.00%) 48.12 ( -0.29%)

4.0.0-rc4 4.0.0-rc4
baseline preserve
User 44383.95 43971.89
System 252.61 201.24
Elapsed 998.68 1000.94

Minor Faults 2597249 1981230
Major Faults 365 364

There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
workload

4.0.0-rc4 4.0.0-rc4
baseline preserve
Amean real-xfsrepair 454.14 ( 0.00%) 442.36 ( 2.60%)
Amean syst-xfsrepair 277.20 ( 0.00%) 204.68 ( 26.16%)

The patch looks hacky but the alternatives looked worse. The tidest was
to rewalk the page tables after a hinting fault but it was more complex
than this approach and the performance was worse. It's not generally
safe to just mark the page writable during the fault if it's a write
fault as it may have been read-only for COW so that approach was
discarded.

Signed-off-by: Mel Gorman
Reported-by: Dave Chinner
Tested-by: Dave Chinner
Cc: Ingo Molnar
Cc: Aneesh Kumar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-03-26 07:20:31 +0800
bea66fbd1 mm: numa: group related processes based on VMA flags instead of page table flags ... Browse Code »

These are three follow-on patches based on the xfsrepair workload Dave
Chinner reported was problematic in 4.0-rc1 due to changes in page table
management -- https://lkml.org/lkml/2015/3/1/226.

Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
Return the correct value for change_huge_pmd"). It was known that the
performance in 3.19 was still better even if is far less safe. This
series aims to restore the performance without compromising on safety.

For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
three patches applied on top

autonumabench
3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
vanilla vanilla vmwrite-v5r8 preserve-v5r8 slowscan-v5r8
Time System-NUMA01 124.00 ( 0.00%) 161.86 (-30.53%) 107.13 ( 13.60%) 103.13 ( 16.83%) 145.01 (-16.94%)
Time System-NUMA01_THEADLOCAL 115.54 ( 0.00%) 107.64 ( 6.84%) 131.87 (-14.13%) 83.30 ( 27.90%) 92.35 ( 20.07%)
Time System-NUMA02 9.35 ( 0.00%) 10.44 (-11.66%) 8.95 ( 4.28%) 10.72 (-14.65%) 8.16 ( 12.73%)
Time System-NUMA02_SMT 3.87 ( 0.00%) 4.63 (-19.64%) 4.57 (-18.09%) 3.99 ( -3.10%) 3.36 ( 13.18%)
Time Elapsed-NUMA01 570.06 ( 0.00%) 567.82 ( 0.39%) 515.78 ( 9.52%) 517.26 ( 9.26%) 543.80 ( 4.61%)
Time Elapsed-NUMA01_THEADLOCAL 393.69 ( 0.00%) 384.83 ( 2.25%) 384.10 ( 2.44%) 384.31 ( 2.38%) 380.73 ( 3.29%)
Time Elapsed-NUMA02 49.09 ( 0.00%) 49.33 ( -0.49%) 48.86 ( 0.47%) 48.78 ( 0.63%) 50.94 ( -3.77%)
Time Elapsed-NUMA02_SMT 47.51 ( 0.00%) 47.15 ( 0.76%) 47.98 ( -0.99%) 48.12 ( -1.28%) 49.56 ( -4.31%)

3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
User 46334.60 46391.94 44383.95 43971.89 44372.12
System 252.84 284.66 252.61 201.24 249.00
Elapsed 1062.14 1050.96 998.68 1000.94 1026.78

Overall the system CPU usage is comparable and the test is naturally a
bit variable. The slowing of the scanner hurts numa01 but on this
machine it is an adverse workload and patches that dramatically help it
often hurt absolutely everything else.

Due to patch 2, the fault activity is interesting

3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
Minor Faults 2097811 2656646 2597249 1981230 1636841
Major Faults 362 450 365 364 365

Note the impact preserving the write bit across protection updates and
fault reduces faults.

NUMA alloc hit 1229008 1217015 1191660 1178322 1199681
NUMA alloc miss 0 0 0 0 0
NUMA interleave hit 0 0 0 0 0
NUMA alloc local 1228514 1216317 1190871 1177448 1199021
NUMA base PTE updates 245706197 240041607 238195516 244704842 115012800
NUMA huge PMD updates 479530 468448 464868 477573 224487
NUMA page range updates 491225557 479886983 476207932 489222218 229950144
NUMA hint faults 659753 656503 641678 656926 294842
NUMA hint local faults 381604 373963 360478 337585 186249
NUMA hint local percent 57 56 56 51 63
NUMA pages migrated 5412140 6374899 6266530 5277468 5755096
AutoNUMA cost 5121% 5083% 4994% 5097% 2388%

Here the impact of slowing the PTE scanner on migratrion failures is
obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are
massively reduced even though the headline performance is very similar.

As xfsrepair was the reported workload here is the impact of the series
on it.

xfsrepair

Min real-fsmark
Min syst-fsmark
Min real-xfsrepair
Min syst-xfsrepair
Amean real-fsmark
Amean syst-fsmark
Amean real-xfsrepair
Amean syst-xfsrepair
Stddev real-fsmark
Stddev syst-fsmark
Stddev real-xfsrepair
Stddev syst-xfsrepair
CoeffVar real-fsmark
CoeffVar syst-fsmark
CoeffVar real-xfsrepair
CoeffVar syst-xfsrepair
Max real-fsmark
Max syst-fsmark
Max real-xfsrepair
Max syst-xfsrepair 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanilla vmwrite-v5r8 preserve-v5r8 slowscan-v5r8 1183.29 ( 0.00%) 1165.73 ( 1.48%) 1152.78 ( 2.58%) 1153.64 ( 2.51%) 1177.62 ( 0.48%) 4107.85 ( 0.00%) 4027.75 ( 1.95%) 3986.74 ( 2.95%) 3979.16 ( 3.13%) 4048.76 ( 1.44%) 441.51 ( 0.00%) 463.96 ( -5.08%) 449.50 ( -1.81%) 440.08 ( 0.32%) 439.87 ( 0.37%) 195.76 ( 0.00%) 278.47 (-42.25%) 262.34 (-34.01%) 203.70 ( -4.06%) 143.64 ( 26.62%) 1188.30 ( 0.00%) 1177.34 ( 0.92%) 1157.97 ( 2.55%) 1158.21 ( 2.53%) 1182.22 ( 0.51%) 4111.37 ( 0.00%) 4055.70 ( 1.35%) 3987.19 ( 3.02%) 3998.72 ( 2.74%) 4061.69 ( 1.21%) 450.88 ( 0.00%) 468.32 ( -3.87%) 454.14 ( -0.72%) 442.36 ( 1.89%) 440.59 ( 2.28%) 199.66 ( 0.00%) 290.60 (-45.55%) 277.20 (-38.84%) 204.68 ( -2.51%) 150.55 ( 24.60%) 4.12 ( 0.00%) 10.82 (-162.29%) 4.14 ( -0.28%) 5.98 (-45.05%) 4.60 (-11.53%) 2.63 ( 0.00%) 20.32 (-671.82%) 0.37 ( 85.89%) 16.47 (-525.59%) 15.05 (-471.79%) 6.87 ( 0.00%) 4.55 ( 33.75%) 3.46 ( 49.58%) 1.78 ( 74.12%) 0.52 ( 92.50%) 3.02 ( 0.00%) 10.30 (-241.37%) 13.17 (-336.37%) 0.71 ( 76.63%) 5.00 (-65.61%) 0.35 ( 0.00%) 0.92 (-164.73%) 0.36 ( -2.91%) 0.52 (-48.82%) 0.39 (-12.10%) 0.06 ( 0.00%) 0.50 (-682.41%) 0.01 ( 85.45%) 0.41 (-543.22%) 0.37 (-478.78%) 1.52 ( 0.00%) 0.97 ( 36.21%) 0.76 ( 49.94%) 0.40 ( 73.62%) 0.12 ( 92.33%) 1.51 ( 0.00%) 3.54 (-134.54%) 4.75 (-214.31%) 0.34 ( 77.20%) 3.32 (-119.63%) 1193.39 ( 0.00%) 1191.77 ( 0.14%) 1162.90 ( 2.55%) 1166.66 ( 2.24%) 1188.50 ( 0.41%) 4114.18 ( 0.00%) 4075.45 ( 0.94%) 3987.65 ( 3.08%) 4019.45 ( 2.30%) 4082.80 ( 0.76%) 457.80 ( 0.00%) 474.60 ( -3.67%) 457.82 ( -0.00%) 444.42 ( 2.92%) 441.03 ( 3.66%) 203.11 ( 0.00%) 303.65 (-49.50%) 294.35 (-44.92%) 205.33 ( -1.09%) 155.28 ( 23.55%)

The really relevant lines as syst-xfsrepair which is the system CPU
usage when running xfsrepair. Note that on my machine the overhead was
45% higher on 4.0-rc4 which may be part of what Dave is seeing. Once we
preserve the write bit across faults, it's only 2.51% higher on average.
With the full series applied, system CPU usage is 24.6% lower on
average.

Again, the impact of preserving the write bit on minor faults is obvious
and the impact of slowing scanning after migration failures is obvious
on the PTE updates. Note also that the number of pages migrated is much
reduced even though the headline performance is comparable.

3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
Minor Faults 153466827 254507978 249163829 153501373 105737890
Major Faults 610 702 690 649 724
NUMA base PTE updates 217735049 210756527 217729596 216937111 144344993
NUMA huge PMD updates 129294 85044 106921 127246 79887
NUMA pages migrated 21938995 29705270 28594162 22687324 16258075

3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
Mean sdb-avgqusz 13.47 2.54 2.55 2.47 2.49
Mean sdb-avgrqsz 202.32 140.22 139.50 139.02 138.12
Mean sdb-await 25.92 5.09 5.33 5.02 5.22
Mean sdb-r_await 4.71 0.19 0.83 0.51 0.11
Mean sdb-w_await 104.13 5.21 5.38 5.05 5.32
Mean sdb-svctm 0.59 0.13 0.14 0.13 0.14
Mean sdb-rrqm 0.16 0.00 0.00 0.00 0.00
Mean sdb-wrqm 3.59 1799.43 1826.84 1812.21 1785.67
Max sdb-avgqusz 111.06 12.13 14.05 11.66 15.60
Max sdb-avgrqsz 255.60 190.34 190.01 187.33 191.78
Max sdb-await 168.24 39.28 49.22 44.64 65.62
Max sdb-r_await 660.00 52.00 280.00 76.00 12.00
Max sdb-w_await 7804.00 39.28 49.22 44.64 65.62
Max sdb-svctm 4.00 2.82 2.86 1.98 2.84
Max sdb-rrqm 8.30 0.00 0.00 0.00 0.00
Max sdb-wrqm 34.20 5372.80 5278.60 5386.60 5546.15

FWIW, I also checked SPECjbb in different configurations but it's
similar observations -- minor faults lower, PTE update activity lower
and performance is roughly comparable against 3.19.

This patch (of 3):

Threads that share writable data within pages are grouped together as
related tasks. This decision is based on whether the PTE is marked
dirty which is subject to timing races between the PTE scanner update
and when the application writes the page. If the page is file-backed,
then background flushes and sync also affect placement. This is
unpredictable behaviour which is impossible to reason about so this
patch makes grouping decisions based on the VMA flags.

Signed-off-by: Mel Gorman
Reported-by: Dave Chinner
Tested-by: Dave Chinner
Cc: Ingo Molnar
Cc: Aneesh Kumar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-03-26 07:20:31 +0800

12 Mar, 2015

1 commit

53da3bc2b mm: fix up numa read-only thread grouping logic ... Browse Code »

Dave Chinner reported that commit 4d9424669946 ("mm: convert
p[te|md]_mknonnuma and remaining page table manipulations") slowed down
his xfsrepair test enormously. In particular, it was using more system
time due to extra TLB flushing.

The ultimate reason turns out to be how the change to use the regular
page table accessor functions broke the NUMA grouping logic. The old
special mknuma/mknonnuma code accessed the page table present bit and
the magic NUMA bit directly, while the new code just changes the page
protections using PROT_NONE and the regular vma protections.

That sounds equivalent, and from a fault standpoint it really is, but a
subtle side effect is that the *other* protection bits of the page table
entries also change. And the code to decide how to group the NUMA
entries together used the writable bit to decide whether a particular
page was likely to be shared read-only or not.

And with the change to make the NUMA handling use the regular permission
setting functions, that writable bit was basically always cleared for
private mappings due to COW. So even if the page actually ends up being
written to in the end, the NUMA balancing would act as if it was always
shared RO.

This code is a heuristic anyway, so the fix - at least for now - is to
instead check whether the page is dirty rather than writable. The bit
doesn't change with protection changes.

NOTE! This also adds a FIXME comment to revisit this issue,

Not only should we probably re-visit the whole "is this a shared
read-only page" heuristic (we might want to take the vma permissions
into account and base this more on those than the per-page ones, and
also look at whether the particular access that triggers it is a write
or not), but the whole COW issue shows that we should think about the
NUMA fault handling some more.

For example, maybe we should do the early-COW thing that a regular fault
does. Or maybe we should accept that while using the same bits as
PROTNONE was a good thing (and got rid of the specual NUMA bit), we
might still want to just preseve the other protection bits across NUMA
faulting.

Those are bigger questions, left for later. This just fixes up the
heuristic so that it at least approximates working again. More analysis
and work needed.

Reported-by: Dave Chinner
Tested-by: Mel Gorman
Cc: Andrew Morton
Cc: Aneesh Kumar
Cc: Ingo Molnar ,
Signed-off-by: Linus Torvalds

Linus Torvalds
2015-03-12 23:45:46 +0800

17 Feb, 2015

2 commits

2e4cdab05 mm: allow page fault handlers to perform the COW ... Browse Code »

Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page. It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.

The handler cannot insert the page itself if there is already a read-only
mapping at that address, so allow the handler to return VM_FAULT_LOCKED
and set the fault_page to be NULL. This indicates to the MM code that the
i_mmap_lock is held instead of the page lock.

Signed-off-by: Matthew Wilcox
Acked-by: Kirill A. Shutemov
Cc: Andreas Dilger
Cc: Boaz Harrosh
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jan Kara
Cc: Jens Axboe
Cc: Mathieu Desnoyers
Cc: Randy Dunlap
Cc: Ross Zwisler
Cc: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2015-02-17 09:56:03 +0800
283307c76 mm: fix XIP fault vs truncate race ... Browse Code »

DAX is a replacement for the variation of XIP currently supported by the
ext2 filesystem. We have three different things in the tree called 'XIP',
and the new focus is on access to data rather than executables, so a name
change was in order. DAX stands for Direct Access. The X is for
eXciting.

The new focus on data access has resulted in more careful attention to
races that exist in the current XIP code, but are not hit by the use-case
that it was designed for. XIP's architecture worked fine for ext2, but
DAX is architected to work with modern filsystems such as ext4 and XFS.
DAX is not intended for use with btrfs; the value that btrfs adds relies
on manipulating data and writing data to different locations, while DAX's
value is for write-in-place and keeping the kernel from touching the data.

DAX was developed in order to support NV-DIMMs, but it's become clear that
its usefuless extends beyond NV-DIMMs and there are several potential
customers including the tracing machinery. Other people want to place the
kernel log in an area of memory, as long as they have a BIOS that does not
clear DRAM on reboot.

Patch 1 is a bug fix, probably worth including in 3.18.

Patches 2 & 3 are infrastructure for DAX.

Patches 4-8 replace the XIP code with its DAX equivalents, transforming
ext2 to use the DAX code as we go. Note that patch 10 is the
Documentation patch.

Patches 9-15 clean up after the XIP code, removing the infrastructure
that is no longer needed and renaming various XIP things to DAX.
Most of these patches were added after Jan found things he didn't
like in an earlier version of the ext4 patch ... that had been copied
from ext2. So ext2 i being transformed to do things the same way that
ext4 will later. The ability to mount ext2 filesystems with the 'xip'
option is retained, although the 'dax' option is now preferred.

Patch 16 adds some DAX infrastructure to support ext4.

Patch 17 adds DAX support to ext4. It is broadly similar to ext2's DAX
support, but it is more efficient than ext4's due to its support for
unwritten extents.

Patch 18 is another cleanup patch renaming XIP to DAX.

My thanks to Mathieu Desnoyers for his reviews of the v11 patchset. Most
of the changes below were based on his feedback.

This patch (of 18):

Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate. We don't have a page to lock in
the XIP case, so use i_mmap_lock_read() instead. It is locked in the
truncate path in unmap_mapping_range() after updating i_size. So while we
hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.

There is a window of time in which i_size has been reduced and the thread
has a mapping to a page which will be removed from the file, but this is
harmless as the page will not be allocated to a different purpose before
the thread's access to it is revoked.

[akpm@linux-foundation.org: switch to i_mmap_lock_read(), add comment in unmap_single_vma()]
Signed-off-by: Matthew Wilcox
Reviewed-by: Jan Kara
Acked-by: Kirill A. Shutemov
Reviewed-by: Mathieu Desnoyers
Cc: Andreas Dilger
Cc: Boaz Harrosh
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jens Axboe
Cc: Randy Dunlap
Cc: Ross Zwisler
Cc: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2015-02-17 09:56:02 +0800

13 Feb, 2015

5 commits

9cb12d7b4 mm/memory.c: actually remap enough memory ... Browse Code »

For whatever reason, generic_access_phys() only remaps one page, but
actually allows to access arbitrary size. It's quite easy to trigger
large reads, like printing out large structure with gdb, which leads to a
crash. Fix it by remapping correct size.

Fixes: 28b2ee20c7cb ("access_process_vm device memory infrastructure")
Signed-off-by: Grazvydas Ignotas
Cc: Rik van Riel
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Grazvydas Ignotas
2015-02-13 10:54:11 +0800
c0e7cad9f mm: numa: add paranoid check around pte_protnone_numa ... Browse Code »

pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are
complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going
to result in strangeness so add a check for it. BUG_ON looks like
overkill but if this is hit then it's a serious bug that could result in
corruption so do not even try recovering. It would have been more
comprehensive to check VMA flags in pte_protnone_numa but it would have
made the API ugly just for a debugging check.

Signed-off-by: Mel Gorman
Cc: Aneesh Kumar K.V
Cc: Benjamin Herrenschmidt
Cc: Dave Jones
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: Kirill Shutemov
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Rik van Riel
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-02-13 10:54:08 +0800
e944fd67b mm: numa: do not trap faults on the huge zero page ... Browse Code »

Faults on the huge zero page are pointless and there is a BUG_ON to catch
them during fault time. This patch reintroduces a check that avoids
marking the zero page PAGE_NONE.

Signed-off-by: Mel Gorman
Cc: Aneesh Kumar K.V
Cc: Benjamin Herrenschmidt
Cc: Dave Jones
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: Kirill Shutemov
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Rik van Riel
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-02-13 10:54:08 +0800
4d9424669 mm: convert p[te|md]_mknonnuma and remaining page table manipulations ... Browse Code »

With PROT_NONE, the traditional page table manipulation functions are
sufficient.

[andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
[akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
Signed-off-by: Mel Gorman
Acked-by: Linus Torvalds
Acked-by: Aneesh Kumar
Tested-by: Sasha Levin
Cc: Benjamin Herrenschmidt
Cc: Dave Jones
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: Kirill Shutemov
Cc: Paul Mackerras
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-02-13 10:54:08 +0800
8a0516ed8 mm: convert p[te|md]_numa users to p[te|md]_protnone_numa ... Browse Code »

Convert existing users of pte_numa and friends to the new helper. Note
that the kernel is broken after this patch is applied until the other page
table modifiers are also altered. This patch layout is to make review
easier.

Signed-off-by: Mel Gorman
Acked-by: Linus Torvalds
Acked-by: Aneesh Kumar
Acked-by: Benjamin Herrenschmidt
Tested-by: Sasha Levin
Cc: Dave Jones
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: Kirill Shutemov
Cc: Paul Mackerras
Cc: Rik van Riel
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-02-13 10:54:08 +0800

12 Feb, 2015

1 commit

dc6c9a35b mm: account pmd page tables to the process ... Browse Code »

Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
kernel doesn't account PMD tables to the process, only PTE.

The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

#include
#include
#include
#include
#include
#include

#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)

#define NR_PUD 130000

int main(void)
{
char *addr = NULL;
unsigned long i;

prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}
*addr = 'x';
munmap(addr, PMD_SIZE);
mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
if (addr == MAP_FAILED)
perror("re-mmap"), exit(1);
}
printf("PID %d consumed %lu KiB in PMD page tables\n",
getpid(), i * 4096 >> 10);
return pause();
}

The patch addresses the issue by account PMD tables to the process the
same way we account PTE.

The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:

- HugeTLB can share PMD page tables. The patch handles by accounting
the table to all processes who share it.

- x86 PAE pre-allocates few PMD tables on fork.

- Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
check on exit(2).

Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded). As with nr_ptes we use per-mm counter. The
counter value is used to calculate baseline for badness score by
oom-killer.

Signed-off-by: Kirill A. Shutemov
Reported-by: Dave Hansen
Cc: Hugh Dickins
Reviewed-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: David Rientjes
Tested-by: Sedat Dilek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-12 09:06:04 +0800

11 Feb, 2015

7 commits

992de5a8e Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc updates from Andrew Morton:
"Bite-sized chunks this time, to avoid the MTA ratelimiting woes.

- fs/notify updates

- ocfs2

- some of MM"

That laconic "some MM" is mainly the removal of remap_file_pages(),
which is a big simplification of the VM, and which gets rid of a *lot*
of random cruft and special cases because we no longer support the
non-linear mappings that it used.

From a user interface perspective, nothing has changed, because the
remap_file_pages() syscall still exists, it's just done by emulating the
old behavior by creating a lot of individual small mappings instead of
one non-linear one.

The emulation is slower than the old "native" non-linear mappings, but
nobody really uses or cares about remap_file_pages(), and simplifying
the VM is a big advantage.

* emailed patches from Andrew Morton : (78 commits)
memcg: zap memcg_slab_caches and memcg_slab_mutex
memcg: zap memcg_name argument of memcg_create_kmem_cache
memcg: zap __memcg_{charge,uncharge}_slab
mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check
mm: hugetlb: fix type of hugetlb_treat_as_movable variable
mm, hugetlb: remove unnecessary lower bound on sysctl handlers"?
mm: memory: merge shared-writable dirtying branches in do_wp_page()
mm: memory: remove ->vm_file check on shared writable vmas
xtensa: drop _PAGE_FILE and pte_file()-related helpers
x86: drop _PAGE_FILE and pte_file()-related helpers
unicore32: drop pte_file()-related helpers
um: drop _PAGE_FILE and pte_file()-related helpers
tile: drop pte_file()-related helpers
sparc: drop pte_file()-related helpers
sh: drop _PAGE_FILE and pte_file()-related helpers
score: drop _PAGE_FILE and pte_file()-related helpers
s390: drop pte_file()-related helpers
parisc: drop _PAGE_FILE and pte_file()-related helpers
openrisc: drop _PAGE_FILE and pte_file()-related helpers
nios2: drop _PAGE_FILE and pte_file()-related helpers
...

Linus Torvalds
2015-02-11 08:45:56 +0800
f38b4b310 mm: memory: merge shared-writable dirtying branches in do_wp_page() ... Browse Code »

Whether there is a vm_ops->page_mkwrite or not, the page dirtying is
pretty much the same. Make sure the page references are the same in both
cases, then merge the two branches.

It's tempting to go even further and page-lock the !page_mkwrite case, to
get it in line with everybody else setting the page table and thus further
simplify the model. But that's not quite compelling enough to justify
dropping the pte lock, then relocking and verifying the entry for
filesystems without ->page_mkwrite, which notably includes tmpfs. Leave
it for now and lock the page late in the !page_mkwrite case.

Signed-off-by: Johannes Weiner
Acked-by: Kirill A. Shutemov
Reviewed-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-02-11 06:30:34 +0800
74ec67511 mm: memory: remove ->vm_file check on shared writable vmas ... Browse Code »

Shared anonymous mmaps are implemented with shmem files, so all VMAs with
shared writable semantics also have an underlying backing file.

Signed-off-by: Johannes Weiner
Reviewed-by: Jan Kara
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-02-11 06:30:33 +0800
0661a3361 mm: remove rest usage of VM_NONLINEAR and pte_file() ... Browse Code »

One bit in ->vm_flags is unused now!

Signed-off-by: Kirill A. Shutemov
Cc: Dan Carpenter
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-11 06:30:31 +0800
9b4bdd2ff mm: drop support of non-linear mapping from fault codepath ... Browse Code »

We don't create non-linear mappings anymore. Let's drop code which
handles them on page fault.

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-11 06:30:30 +0800
8a5f14a23 mm: drop support of non-linear mapping from unmap/zap codepath ... Browse Code »

We have remap_file_pages(2) emulation in -mm tree for few release cycles
and we plan to have it mainline in v3.20. This patchset removes rest of
VM_NONLINEAR infrastructure.

Patches 1-8 take care about generic code. They are pretty
straight-forward and can be applied without other of patches.

Rest patches removes pte_file()-related stuff from architecture-specific
code. It usually frees up one bit in non-present pte. I've tried to reuse
that bit for swap offset, where I was able to figure out how to do that.

For obvious reason I cannot test all that arch-specific code and would
like to see acks from maintainers.

In total, remap_file_pages(2) required about 1.4K lines of not-so-trivial
kernel code. That's too much for functionality nobody uses.

Tested-by: Felipe Balbi

This patch (of 38):

We don't create non-linear mappings anymore. Let's drop code which
handles them on unmap/zap.

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-11 06:30:30 +0800
bdccc4ede Merge tag 'stable/for-linus-3.20-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip ... Browse Code »

Pull xen features and fixes from David Vrabel:

- Reworked handling for foreign (grant mapped) pages to simplify the
code, enable a number of additional use cases and fix a number of
long-standing bugs.

- Prefer the TSC over the Xen PV clock when dom0 (and the TSC is
stable).

- Assorted other cleanup and minor bug fixes.

* tag 'stable/for-linus-3.20-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (25 commits)
xen/manage: Fix USB interaction issues when resuming
xenbus: Add proper handling of XS_ERROR from Xenbus for transactions.
xen/gntdev: provide find_special_page VMA operation
xen/gntdev: mark userspace PTEs as special on x86 PV guests
xen-blkback: safely unmap grants in case they are still in use
xen/gntdev: safely unmap grants in case they are still in use
xen/gntdev: convert priv->lock to a mutex
xen/grant-table: add a mechanism to safely unmap pages that are in use
xen-netback: use foreign page information from the pages themselves
xen: mark grant mapped pages as foreign
xen/grant-table: add helpers for allocating pages
x86/xen: require ballooned pages for grant maps
xen: remove scratch frames for ballooned pages and m2p override
xen/grant-table: pre-populate kernel unmap ops for xen_gnttab_unmap_refs()
mm: add 'foreign' alias for the 'pinned' page flag
mm: provide a find_special_page vma operation
x86/xen: cleanup arch/x86/xen/mmu.c
x86/xen: add some __init annotations in arch/x86/xen/mmu.c
x86/xen: add some __init and static annotations in arch/x86/xen/setup.c
x86/xen: use correct types for addresses in arch/x86/xen/setup.c
...

Linus Torvalds
2015-02-11 05:56:56 +0800