Eric Lee / smarc-fsl-linux-kernel

25 Jun, 2015

1 commit

e37609bb3 mm/mmap.c: optimization of do_mmap_pgoff function ... Browse Code »

The simple check for zero length memory mapping may be performed
earlier. So that in case of zero length memory mapping some unnecessary
code is not executed at all. It does not make the code less readable
and saves some CPU cycles.

Signed-off-by: Piotr Kwapulinski
Acked-by: Michal Hocko
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Piotr Kwapulinski
2015-06-25 08:49:45 +0800

16 Apr, 2015

2 commits

9fcd14571 mm/mmap.c: use while instead of if+goto ... Browse Code »

The creators of the C language gave us the while keyword. Let's use
that instead of synthesizing it from if+goto.

Made possible by 6597d783397a ("mm/mmap.c: replace find_vma_prepare()
with clearer find_vma_links()").

[akpm@linux-foundation.org: fix 80-col overflows]
Signed-off-by: Rasmus Villemoes
Cc: "Kirill A. Shutemov"
Cc: Sasha Levin
Cc: Cyrill Gorcunov
Cc: Roman Gushchin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rasmus Villemoes
2015-04-16 07:35:19 +0800
4db0c3c29 mm: remove rest of ACCESS_ONCE() usages ... Browse Code »

We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
tree since it doesn't work reliably on non-scalar types.

This patch removes the rest of the usages of ACCESS_ONCE, and use the new
READ_ONCE API for the read accesses. This makes things cleaner, instead
of using separate/multiple sets of APIs.

Signed-off-by: Jason Low
Acked-by: Michal Hocko
Acked-by: Davidlohr Bueso
Acked-by: Rik van Riel
Reviewed-by: Christian Borntraeger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Low
2015-04-16 07:35:18 +0800

15 Apr, 2015

1 commit

fc05f5662 mm: rename __mlock_vma_pages_range() to populate_vma_page_range() ... Browse Code »

__mlock_vma_pages_range() doesn't necessarily mlock pages. It depends on
vma flags. The same codepath is used for MAP_POPULATE.

Let's rename __mlock_vma_pages_range() to populate_vma_page_range().

This patch also drops mlock_vma_pages_range() references from
documentation. It has gone in cea10a19b797 ("mm: directly use
__mlock_vma_pages_range() in find_extend_vma()").

Signed-off-by: Kirill A. Shutemov
Acked-by: Linus Torvalds
Acked-by: David Rientjes
Cc: Michel Lespinasse
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-04-15 07:49:00 +0800

26 Mar, 2015

1 commit

3fe89b3e2 mm: fix anon_vma->degree underflow in anon_vma endless growing prevention ... Browse Code »

I have constantly stumbled upon "kernel BUG at mm/rmap.c:399!" after
upgrading to 3.19 and had no luck with 4.0-rc1 neither.

So, after looking into new logic introduced by commit 7a3ef208e662 ("mm:
prevent endless growth of anon_vma hierarchy"), I found chances are that
unlink_anon_vmas() is called without incrementing dst->anon_vma->degree
in anon_vma_clone() due to allocation failure. If dst->anon_vma is not
NULL in error path, its degree will be incorrectly decremented in
unlink_anon_vmas() and eventually underflow when exiting as a result of
another call to unlink_anon_vmas(). That's how "kernel BUG at
mm/rmap.c:399!" is triggered for me.

This patch fixes the underflow by dropping dst->anon_vma when allocation
fails. It's safe to do so regardless of original value of dst->anon_vma
because dst->anon_vma doesn't have valid meaning if anon_vma_clone()
fails. Besides, callers don't care dst->anon_vma in such case neither.

Also suggested by Michal Hocko, we can clean up vma_adjust() a bit as
anon_vma_clone() now does the work.

[akpm@linux-foundation.org: tweak comment]
Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy")
Signed-off-by: Leon Yu
Signed-off-by: Konstantin Khlebnikov
Reviewed-by: Michal Hocko
Acked-by: Rik van Riel
Acked-by: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Leon Yu
2015-03-26 07:20:30 +0800

12 Feb, 2015

3 commits

5703b087d mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() ... Browse Code »

I noticed, that "allowed" can easily overflow by falling below 0,
because (total_vm / 32) can be larger than "allowed". The problem
occurs in OVERCOMMIT_NONE mode.

In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode). All subsequent allocations will fall
(system-wide), so system become unusable.

The problem was masked out by commit c9b1d0981fcc
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory

It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.

Fix this issue by switching to signed arithmetic here.

[akpm@linux-foundation.org: use min_t]
Signed-off-by: Roman Gushchin
Cc: Andrew Shewmaker
Cc: Rik van Riel
Cc: Konstantin Khlebnikov
Reviewed-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2015-02-12 09:06:07 +0800
b30fe6c7c mm: fix false-positive warning on exit due mm_nr_pmds(mm) ... Browse Code »

The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens
*before* pgd_free(). And if an arch does pte/pmd allocation in
pgd_alloc() and frees them in pgd_free() we see offset in counters by the
time of the checks.

We tried to workaround this by offsetting expected counter value according
to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap(). But it
doesn't work in some cases:

1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but
upper addresses occupied with huge pmd entries, so the trick with
offsetting expected counter value will get really ugly: we will have
to apply it nr_pmds, but not nr_ptes.

2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation
pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry
which is allocated at boot and shared accross all processes.

The proposal is to move the check to check_mm() which happens *after*
pgd_free() and do proper accounting during pgd_alloc() and pgd_free()
which would bring counters to zero if nothing leaked.

Signed-off-by: Kirill A. Shutemov
Reported-by: Tyler Baker
Tested-by: Tyler Baker
Tested-by: Nishanth Menon
Cc: Russell King
Cc: James Hogan
Cc: Guan Xuetao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-12 09:06:04 +0800
dc6c9a35b mm: account pmd page tables to the process ... Browse Code »

Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
kernel doesn't account PMD tables to the process, only PTE.

The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

#include
#include
#include
#include
#include
#include

#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)

#define NR_PUD 130000

int main(void)
{
char *addr = NULL;
unsigned long i;

prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}
*addr = 'x';
munmap(addr, PMD_SIZE);
mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
if (addr == MAP_FAILED)
perror("re-mmap"), exit(1);
}
printf("PID %d consumed %lu KiB in PMD page tables\n",
getpid(), i * 4096 >> 10);
return pause();
}

The patch addresses the issue by account PMD tables to the process the
same way we account PTE.

The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:

- HugeTLB can share PMD page tables. The patch handles by accounting
the table to all processes who share it.

- x86 PAE pre-allocates few PMD tables on fork.

- Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
check on exit(2).

Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded). As with nr_ptes we use per-mm counter. The
counter value is used to calculate baseline for badness score by
oom-killer.

Signed-off-by: Kirill A. Shutemov
Reported-by: Dave Hansen
Cc: Hugh Dickins
Reviewed-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: David Rientjes
Tested-by: Sedat Dilek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-12 09:06:04 +0800

11 Feb, 2015

2 commits

27ba0644e rmap: drop support of non-linear mappings ... Browse Code »

We don't create non-linear mappings anymore. Let's drop code which
handles them in rmap.

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-11 06:30:31 +0800
c8d78c182 mm: replace remap_file_pages() syscall with emulation ... Browse Code »

remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.

Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.

Let's drop it and get rid of all these special-cased code.

The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.

I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.

There are few basic tests in LTP for the syscall. They work just fine
with emulation.

To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.

The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.

Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88x

I believe we can live with that.

Test case:

#define _GNU_SOURCE
#include
#include
#include
#include

#define MB (1024UL * 1024)
#define SIZE (4096 * MB)

int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;

for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}

for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;

for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}

for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);

munmap(p, SIZE);
}

return 0;
}

[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov"
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Dave Jones
Cc: Linus Torvalds
Cc: Armin Rigo
Signed-off-by: Sasha Levin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-11 06:30:30 +0800

12 Jan, 2015

2 commits

b800c91a0 mm: fix corner case in anon_vma endless growing prevention ... Browse Code »

Fix for BUG_ON(anon_vma->degree) splashes in unlink_anon_vmas() ("kernel
BUG at mm/rmap.c:399!") caused by commit 7a3ef208e662 ("mm: prevent
endless growth of anon_vma hierarchy")

Anon_vma_clone() is usually called for a copy of source vma in
destination argument. If source vma has anon_vma it should be already
in dst->anon_vma. NULL in dst->anon_vma is used as a sign that it's
called from anon_vma_fork(). In this case anon_vma_clone() finds
anon_vma for reusing.

Vma_adjust() calls it differently and this breaks anon_vma reusing
logic: anon_vma_clone() links vma to old anon_vma and updates degree
counters but vma_adjust() overrides vma->anon_vma right after that. As
a result final unlink_anon_vmas() decrements degree for wrong anon_vma.

This patch assigns ->anon_vma before calling anon_vma_clone().

Signed-off-by: Konstantin Khlebnikov
Reported-and-tested-by: Chris Clayton
Reported-and-tested-by: Oded Gabbay
Reported-and-tested-by: Chih-Wei Huang
Acked-by: Rik van Riel
Acked-by: Vlastimil Babka
Cc: Daniel Forrest
Cc: Michal Hocko
Cc: stable@vger.kernel.org # to match back-porting of 7a3ef208e662
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2015-01-12 03:45:10 +0800
690eac53d mm: Don't count the stack guard page towards RLIMIT_STACK ... Browse Code »

Commit fee7e49d4514 ("mm: propagate error from stack expansion even for
guard page") made sure that we return the error properly for stack
growth conditions. It also theorized that counting the guard page
towards the stack limit might break something, but also said "Let's see
if anybody notices".

Somebody did notice. Apparently android-x86 sets the stack limit very
close to the limit indeed, and including the guard page in the rlimit
check causes the android 'zygote' process problems.

So this adds the (fairly trivial) code to make the stack rlimit check be
against the actual real stack size, rather than the size of the vma that
includes the guard page.

Reported-and-tested-by: Chih-Wei Huang
Cc: Jay Foad
Cc: stable@kernel.org # to match back-porting of fee7e49d4514
Signed-off-by: Linus Torvalds

Linus Torvalds
2015-01-12 03:33:57 +0800

14 Dec, 2014

3 commits

e1d6d01ab mm: export find_extend_vma() and handle_mm_fault() for driver use ... Browse Code »

This lets drivers like the AMD IOMMUv2 driver handle faults a bit more
simply, rather than doing tricks with page refs and get_user_pages().

Signed-off-by: Jesse Barnes
Cc: Oded Gabbay
Cc: Joerg Roedel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesse Barnes
2014-12-14 04:42:47 +0800
c8c06efa8 mm: convert i_mmap_mutex to rwsem ... Browse Code »

The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
similar data, one for file backed pages and the other for anon memory. To
this end, this lock can also be a rwsem. In addition, there are some
important opportunities to share the lock when there are no tree
modifications.

This conversion is straightforward. For now, all users take the write
lock.

[sfr@canb.auug.org.au: update fremap.c]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: "Kirill A. Shutemov"
Acked-by: Hugh Dickins
Cc: Oleg Nesterov
Acked-by: Peter Zijlstra (Intel)
Cc: Srikar Dronamraju
Acked-by: Mel Gorman
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-12-14 04:42:45 +0800
83cde9e8b mm: use new helper functions around the i_mmap_mutex ... Browse Code »

Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.

Signed-off-by: Davidlohr Bueso
Acked-by: Rik van Riel
Acked-by: "Kirill A. Shutemov"
Acked-by: Hugh Dickins
Cc: Oleg Nesterov
Acked-by: Peter Zijlstra (Intel)
Cc: Srikar Dronamraju
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-12-14 04:42:45 +0800

11 Dec, 2014

1 commit

3eb5b893e Merge branch 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 MPX support from Thomas Gleixner:
"This enables support for x86 MPX.

MPX is a new debug feature for bound checking in user space. It
requires kernel support to handle the bound tables and decode the
bound violating instruction in the trap handler"

* 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
asm-generic: Remove asm-generic arch_bprm_mm_init()
mm: Make arch_unmap()/bprm_mm_init() available to all architectures
x86: Cleanly separate use of asm-generic/mm_hooks.h
x86 mpx: Change return type of get_reg_offset()
fs: Do not include mpx.h in exec.c
x86, mpx: Add documentation on Intel MPX
x86, mpx: Cleanup unused bound tables
x86, mpx: On-demand kernel allocation of bounds tables
x86, mpx: Decode MPX instruction to get bound violation information
x86, mpx: Add MPX-specific mmap interface
x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
x86, mpx: Add MPX to disabled features
ia64: Sync struct siginfo with general version
mips: Sync struct siginfo with general version
mpx: Extend siginfo structure to include bound violation information
x86, mpx: Rename cfg_reg_u and status_reg
x86: mpx: Give bndX registers actual names
x86: Remove arbitrary instruction size limit in instruction decoder

Linus Torvalds
2014-12-11 01:34:43 +0800

04 Dec, 2014

1 commit

c4ea95d7c mm: fix anon_vma_clone() error treatment ... Browse Code »

Andrew Morton noticed that the error return from anon_vma_clone() was
being dropped and replaced with -ENOMEM (which is not itself a bug
because the only error return value from anon_vma_clone() is -ENOMEM).

I did an audit of callers of anon_vma_clone() and discovered an actual
bug where the error return was being lost. In __split_vma(), between
Linux 3.11 and 3.12 the code was changed so the err variable is used
before the call to anon_vma_clone() and the default initial value of
-ENOMEM is overwritten. So a failure of anon_vma_clone() will return
success since err at this point is now zero.

Below is a patch which fixes this bug and also propagates the error
return value from anon_vma_clone() in all cases.

Fixes: ef0855d334e1 ("mm: mempolicy: turn vma_set_policy() into vma_dup_policy()")
Signed-off-by: Daniel Forrest
Reviewed-by: Michal Hocko
Cc: Konstantin Khlebnikov
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Tim Hartrick
Cc: Hugh Dickins
Cc: Michel Lespinasse
Cc: Vlastimil Babka
Cc: [3.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Forrest
2014-12-04 01:36:04 +0800

18 Nov, 2014

1 commit

1de4fa14e x86, mpx: Cleanup unused bound tables ... Browse Code »

The previous patch allocates bounds tables on-demand. As noted in
an earlier description, these can add up to *HUGE* amounts of
memory. This has caused OOMs in practice when running tests.

This patch adds support for freeing bounds tables when they are no
longer in use.

There are two types of mappings in play when unmapping tables:
1. The mapping with the actual data, which userspace is
munmap()ing or brk()ing away, etc...
2. The mapping for the bounds table *backing* the data
(is tagged with VM_MPX, see the patch "add MPX specific
mmap interface").

If userspace use the prctl() indroduced earlier in this patchset
to enable the management of bounds tables in kernel, when it
unmaps the first type of mapping with the actual data, the kernel
needs to free the mapping for the bounds table backing the data.
This patch hooks in at the very end of do_unmap() to do so.
We look at the addresses being unmapped and find the bounds
directory entries and tables which cover those addresses. If
an entire table is unused, we clear associated directory entry
and free the table.

Once we unmap the bounds table, we would have a bounds directory
entry pointing at empty address space. That address space might
now be allocated for some other (random) use, and the MPX
hardware might now try to walk it as if it were a bounds table.
That would be bad. So any unmapping of an enture bounds table
has to be accompanied by a corresponding write to the bounds
directory entry to invalidate it. That write to the bounds
directory can fault, which causes the following problem:

Since we are doing the freeing from munmap() (and other paths
like it), we hold mmap_sem for write. If we fault, the page
fault handler will attempt to acquire mmap_sem for read and
we will deadlock. To avoid the deadlock, we pagefault_disable()
when touching the bounds directory entry and use a
get_user_pages() to resolve the fault.

The unmapping of bounds tables happends under vm_munmap(). We
also (indirectly) call vm_munmap() to _do_ the unmapping of the
bounds tables. We avoid unbounded recursion by disallowing
freeing of bounds tables *for* bounds tables. This would not
occur normally, so should not have any practical impact. Being
strict about it here helps ensure that we do not have an
exploitable stack overflow.

Based-on-patch-by: Qiaowei Ren
Signed-off-by: Dave Hansen
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen
Link: http://lkml.kernel.org/r/20141114151831.E4531C4A@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner

Dave Hansen
2014-11-18 07:58:54 +0800

30 Oct, 2014

1 commit

6d50e60cd mm, thp: fix collapsing of hugepages on madvise ... Browse Code »

If an anonymous mapping is not allowed to fault thp memory and then
madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
collapse this memory into thp memory.

This occurs because the madvise(2) handler for thp, hugepage_madvise(),
clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
until the final action of madvise_behavior(). This causes the
khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
the vma had previously had VM_NOHUGEPAGE set.

Fix this by passing the correct vma flags to the khugepaged mm slot
handler. There's no chance khugepaged can run on this vma until after
madvise_behavior() returns since we hold mm->mmap_sem.

It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
in hugepage_advise(), but I didn't want to introduce special case
behavior into madvise_behavior(). I think it's best to just let it
always set vma->vm_flags itself.

Signed-off-by: David Rientjes
Reported-by: Suleiman Souhlal
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-10-30 07:33:14 +0800

14 Oct, 2014

1 commit

64e455079 mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared ... Browse Code »

For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.

Here's a simple code snippet to demonstrate the bug:

char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */

With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.

As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").

Signed-off-by: Peter Feiner
Reported-by: Peter Feiner
Suggested-by: Kirill A. Shutemov
Cc: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Jamie Liu
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: Bjorn Helgaas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Feiner
2014-10-14 08:18:28 +0800

10 Oct, 2014

6 commits

c798360cd Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

Pull percpu updates from Tejun Heo:
"A lot of activities on percpu front. Notable changes are...

- percpu allocator now can take @gfp. If @gfp doesn't contain
GFP_KERNEL, it tries to allocate from what's already available to
the allocator and a work item tries to keep the reserve around
certain level so that these atomic allocations usually succeed.

This will replace the ad-hoc percpu memory pool used by
blk-throttle and also be used by the planned blkcg support for
writeback IOs.

Please note that I noticed a bug in how @gfp is interpreted while
preparing this pull request and applied the fix 6ae833c7fe0c
("percpu: fix how @gfp is interpreted by the percpu allocator")
just now.

- percpu_ref now uses longs for percpu and global counters instead of
ints. It leads to more sparse packing of the percpu counters on
64bit machines but the overhead should be negligible and this
allows using percpu_ref for refcnting pages and in-memory objects
directly.

- The switching between percpu and single counter modes of a
percpu_ref is made independent of putting the base ref and a
percpu_ref can now optionally be initialized in single or killed
mode. This allows avoiding percpu shutdown latency for cases where
the refcounted objects may be synchronously created and destroyed
in rapid succession with only a fraction of them reaching fully
operational status (SCSI probing does this when combined with
blk-mq support). It's also planned to be used to implement forced
single mode to detect underflow more timely for debugging.

There's a separate branch percpu/for-3.18-consistent-ops which cleans
up the duplicate percpu accessors. That branch causes a number of
conflicts with s390 and other trees. I'll send a separate pull
request w/ resolutions once other branches are merged"

* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
percpu: fix how @gfp is interpreted by the percpu allocator
blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
percpu_ref: add PERCPU_REF_INIT_* flags
percpu_ref: decouple switching to percpu mode and reinit
percpu_ref: decouple switching to atomic mode and killing
percpu_ref: add PCPU_REF_DEAD
percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
percpu_ref: replace pcpu_ prefix with percpu_
percpu_ref: minor code and comment updates
percpu_ref: relocate percpu_ref_reinit()
Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
Revert "percpu: free percpu allocation info for uniprocessor system"
percpu-refcount: make percpu_ref based on longs instead of ints
percpu-refcount: improve WARN messages
percpu: fix locking regression in the failure path of pcpu_alloc()
percpu-refcount: add @gfp to percpu_ref_init()
proportions: add @gfp to init functions
percpu_counter: add @gfp to percpu_counter_init()
percpu_counter: make percpu_counters_lock irq-safe
...

Linus Torvalds
2014-10-10 19:26:02 +0800
96dad67ff mm: use VM_BUG_ON_MM where possible ... Browse Code »

Dump the contents of the relevant struct_mm when we hit the bug condition.

Signed-off-by: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2014-10-10 10:25:58 +0800
ff26f70f4 mm/mmap.c: clean up CONFIG_DEBUG_VM_RB checks ... Browse Code »

- be consistent in printing the test which failed

- one message was actually wrong (aa)

- don't print second bogus warning if browse_rb() failed

Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2014-10-10 10:25:57 +0800

81d1b09c6 mm: convert a few VM_BUG_ON callers to VM_BUG_ON_VMA ... Browse Code »

Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
more information when they trigger.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Sasha Levin
Reviewed-by: Naoya Horiguchi
Cc: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Vlastimil Babka
Cc: Michel Lespinasse
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2014-10-10 10:25:57 +0800

8764b338b mm: use may_adjust_brk helper ... Browse Code »

Signed-off-by: Cyrill Gorcunov
Cc: Kees Cook
Cc: Tejun Heo
Cc: Andrew Vagin
Cc: Eric W. Biederman
Cc: H. Peter Anvin
Acked-by: Serge Hallyn
Cc: Pavel Emelyanov
Cc: Vasiliy Kulikov
Cc: KAMEZAWA Hiroyuki
Cc: Michael Kerrisk
Cc: Julien Tinnes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2014-10-10 10:25:55 +0800

cc71aba34 mm/mmap.c: whitespace fixes ... Browse Code »

Signed-off-by: vishnu.ps
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

vishnu.ps
2014-10-10 10:25:52 +0800

25 Sep, 2014

1 commit

d06efebf0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/… ... Browse Code »

…linux-block into for-3.18

This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
commit reverted and patches to implement proper fix will be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>

Tejun Heo
2014-09-25 01:00:21 +0800

11 Sep, 2014

1 commit

8542bdfc6 mm/mmap.c: use pr_emerg when printing BUG related information ... Browse Code »

Make sure we actually see the output of validate_mm() and browse_rb()
before triggering a BUG(). pr_info isn't shown by default so the reason
for the BUG() isn't obvious.

Signed-off-by: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2014-09-11 06:42:12 +0800

08 Sep, 2014

1 commit

908c7f194 percpu_counter: add @gfp to percpu_counter_init() ... Browse Code »

Percpu allocator now supports allocation mask. Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.

We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion. This is the one with
the most users. Other percpu data structures are a lot easier to
convert.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo
Acked-by: Jan Kara
Acked-by: "David S. Miller"
Cc: x86@kernel.org
Cc: Jens Axboe
Cc: "Theodore Ts'o"
Cc: Alexander Viro
Cc: Andrew Morton

Tejun Heo
2014-09-08 08:51:29 +0800

09 Aug, 2014

1 commit

4bb5f5d93 mm: allow drivers to prevent new writable mappings ... Browse Code »

This patch (of 6):

The i_mmap_writable field counts existing writable mappings of an
address_space. To allow drivers to prevent new writable mappings, make
this counter signed and prevent new writable mappings if it is negative.
This is modelled after i_writecount and DENYWRITE.

This will be required by the shmem-sealing infrastructure to prevent any
new writable mappings after the WRITE seal has been set. In case there
exists a writable mapping, this operation will fail with EBUSY.

Note that we rely on the fact that iff you already own a writable mapping,
you can increase the counter without using the helpers. This is the same
that we do for i_writecount.

Signed-off-by: David Herrmann
Acked-by: Hugh Dickins
Cc: Michael Kerrisk
Cc: Ryan Lortie
Cc: Lennart Poettering
Cc: Daniel Mack
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Herrmann
2014-08-09 06:57:31 +0800

07 Aug, 2014

1 commit

82f71ae4a mm: catch memory commitment underflow ... Browse Code »

Print a warning (if CONFIG_DEBUG_VM=y) when memory commitment becomes
too negative.

This shouldn't happen any more - the previous two patches fixed the
committed_as underflow issues.

[akpm@linux-foundation.org: use VM_WARN_ONCE, per Dave]
Signed-off-by: Konstantin Khlebnikov
Cc: Hugh Dickins
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2014-08-07 09:01:19 +0800

07 Jun, 2014

1 commit

b1de0d139 mm: convert some level-less printks to pr_* ... Browse Code »

printk is meant to be used with an associated log level. There are some
instances of printk scattered around the mm code where the log level is
missing. Add a log level and adhere to suggestions by
scripts/checkpatch.pl by moving to the pr_* macros.

Also add the typical pr_fmt definition so that print statements can be
easily traced back to the modules where they occur, correlated one with
another, etc. This will require the removal of some (now redundant)
prefixes on a few print statements.

Signed-off-by: Mitchel Humpherys
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mitchel Humpherys
2014-06-07 07:08:18 +0800

05 Jun, 2014

3 commits

a0abcf2e8 Merge branch 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next ... Browse Code »

Pull x86 cdso updates from Peter Anvin:
"Vdso cleanups and improvements largely from Andy Lutomirski. This
makes the vdso a lot less ''special''"

* 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso, build: Make LE access macros clearer, host-safe
x86/vdso, build: Fix cross-compilation from big-endian architectures
x86/vdso, build: When vdso2c fails, unlink the output
x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
x86, mm: Improve _install_special_mapping and fix x86 vdso naming
mm, fs: Add vm_ops->name as an alternative to arch_vma_name
x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
x86, vdso: Move the 32-bit vdso special pages after the text
x86, vdso: Reimplement vdso.so preparation in build-time C
x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
x86, vdso: Clean up 32-bit vs 64-bit vdso params
x86, mm: Ensure correct alignment of the fixmap

Linus Torvalds
2014-06-05 23:05:29 +0800

64ac4940d mm/mmap.c: remove the first mapping check ... Browse Code »

Remove the first mapping check for vma_link. Move the mutex_lock into the
braces when vma->vm_file is true.

Signed-off-by: Huang Shijie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Shijie
2014-06-05 07:54:01 +0800

14bd5b458 mm/mmap.c: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO ... Browse Code »

Fix a coccinelle error regarding usage of IS_ERR and PTR_ERR instead of
PTR_ERR_OR_ZERO.

Signed-off-by: Duan Jiong
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Duan Jiong
2014-06-05 07:53:58 +0800

21 May, 2014

1 commit

a62c34bd2 x86, mm: Improve _install_special_mapping and fix x86 vdso naming ... Browse Code »

Using arch_vma_name to give special mappings a name is awkward. x86
currently implements it by comparing the start address of the vma to
the expected address of the vdso. This requires tracking the start
address of special mappings and is probably buggy if a special vma
is split or moved.

Improve _install_special_mapping to just name the vma directly. Use
it to give the x86 vvar area a name, which should make CRIU's life
easier.

As a side effect, the vvar area will show up in core dumps. This
could be considered weird and is fixable.

[hpa: I say we accept this as-is but be prepared to deal with knocking
out the vvars from core dumps if this becomes a problem.]

Cc: Cyrill Gorcunov
Cc: Pavel Emelyanov
Signed-off-by: Andy Lutomirski
Link: http://lkml.kernel.org/r/276b39b6b645fb11e345457b503f17b83c2c6fd0.1400538962.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin

Andy Lutomirski
2014-05-21 02:38:42 +0800

08 Apr, 2014

1 commit

615d6e875 mm: per-thread vma caching ... Browse Code »

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: Linus Torvalds
Reviewed-by: Michel Lespinasse
Cc: Oleg Nesterov
Tested-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-04-08 07:35:53 +0800

05 Apr, 2014

1 commit

f7789dc0d Merge branch 'locks-3.15' of git://git.samba.org/jlayton/linux ... Browse Code »

Pull file locking updates from Jeff Layton:
"Highlights:

- maintainership change for fs/locks.c. Willy's not interested in
maintaining it these days, and is OK with Bruce and I taking it.
- fix for open vs setlease race that Al ID'ed
- cleanup and consolidation of file locking code
- eliminate unneeded BUG() call
- merge of file-private lock implementation"

* 'locks-3.15' of git://git.samba.org/jlayton/linux:
locks: make locks_mandatory_area check for file-private locks
locks: fix locks_mandatory_locked to respect file-private locks
locks: require that flock->l_pid be set to 0 for file-private locks
locks: add new fcntl cmd values for handling file private locks
locks: skip deadlock detection on FL_FILE_PVT locks
locks: pass the cmd value to fcntl_getlk/getlk64
locks: report l_pid as -1 for FL_FILE_PVT locks
locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT"
locks: rename locks_remove_flock to locks_remove_file
locks: consolidate checks for compatible filp->f_mode values in setlk handlers
locks: fix posix lock range overflow handling
locks: eliminate BUG() call when there's an unexpected lock on file close
locks: add __acquires and __releases annotations to locks_start and locks_stop
locks: remove "inline" qualifier from fl_link manipulation functions
locks: clean up comment typo
locks: close potential race between setlease and open
MAINTAINERS: update entry for fs/locks.c

Linus Torvalds
2014-04-05 05:21:20 +0800

04 Apr, 2014

1 commit

eafd4dc4d mm/mmap.c: mark function as static ... Browse Code »

Mark function as static in mmap.c because they are not used outside this
file.

This eliminates the following warning in mm/mmap.c:

mm/mmap.c:407:6: warning: no previous prototype for `validate_mm' [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria
Reviewed-by: Josh Triplett
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rashika Kheria
2014-04-04 07:21:02 +0800

31 Mar, 2014

1 commit

d7a06983a locks: fix locks_mandatory_locked to respect file-private locks ... Browse Code »

As Trond pointed out, you can currently deadlock yourself by setting a
file-private lock on a file that requires mandatory locking and then
trying to do I/O on it.

Avoid this problem by plumbing some knowledge of file-private locks into
the mandatory locking code. In order to do this, we must pass down
information about the struct file that's being used to
locks_verify_locked.

Reported-by: Trond Myklebust
Signed-off-by: Jeff Layton
Acked-by: J. Bruce Fields

Jeff Layton
2014-03-31 20:24:43 +0800