Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

08 Apr, 2014

1 commit

615d6e875 mm: per-thread vma caching ... Browse Code »
7

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: Linus Torvalds
Reviewed-by: Michel Lespinasse
Cc: Oleg Nesterov
Tested-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-04-08 07:35:53 +0800

05 Apr, 2014

1 commit

f7789dc0d Merge branch 'locks-3.15' of git://git.samba.org/jlayton/linux ... Browse Code »

Pull file locking updates from Jeff Layton:
"Highlights:

- maintainership change for fs/locks.c. Willy's not interested in
maintaining it these days, and is OK with Bruce and I taking it.
- fix for open vs setlease race that Al ID'ed
- cleanup and consolidation of file locking code
- eliminate unneeded BUG() call
- merge of file-private lock implementation"

* 'locks-3.15' of git://git.samba.org/jlayton/linux:
locks: make locks_mandatory_area check for file-private locks
locks: fix locks_mandatory_locked to respect file-private locks
locks: require that flock->l_pid be set to 0 for file-private locks
locks: add new fcntl cmd values for handling file private locks
locks: skip deadlock detection on FL_FILE_PVT locks
locks: pass the cmd value to fcntl_getlk/getlk64
locks: report l_pid as -1 for FL_FILE_PVT locks
locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT"
locks: rename locks_remove_flock to locks_remove_file
locks: consolidate checks for compatible filp->f_mode values in setlk handlers
locks: fix posix lock range overflow handling
locks: eliminate BUG() call when there's an unexpected lock on file close
locks: add __acquires and __releases annotations to locks_start and locks_stop
locks: remove "inline" qualifier from fl_link manipulation functions
locks: clean up comment typo
locks: close potential race between setlease and open
MAINTAINERS: update entry for fs/locks.c

Linus Torvalds
2014-04-05 05:21:20 +0800

04 Apr, 2014

1 commit

eafd4dc4d mm/mmap.c: mark function as static ... Browse Code »

Mark function as static in mmap.c because they are not used outside this
file.

This eliminates the following warning in mm/mmap.c:

mm/mmap.c:407:6: warning: no previous prototype for `validate_mm' [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria
Reviewed-by: Josh Triplett
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rashika Kheria
2014-04-04 07:21:02 +0800

31 Mar, 2014

1 commit

d7a06983a locks: fix locks_mandatory_locked to respect file-private locks ... Browse Code »

As Trond pointed out, you can currently deadlock yourself by setting a
file-private lock on a file that requires mandatory locking and then
trying to do I/O on it.

Avoid this problem by plumbing some knowledge of file-private locks into
the mandatory locking code. In order to do this, we must pass down
information about the struct file that's being used to
locks_verify_locked.

Reported-by: Trond Myklebust
Signed-off-by: Jeff Layton
Acked-by: J. Bruce Fields

Jeff Layton
2014-03-31 20:24:43 +0800

19 Mar, 2014

1 commit

3935ed6a3 mm: Add new func _install_special_mapping() to mmap.c ... Browse Code »

The _install_special_mapping() is the new base function for
install_special_mapping(). This function will return a pointer of the
created VMA or a error code in an ERR_PTR()

This new function will be needed by the for the vdso 32 bit support to map the
additonal vvar and hpet pages into the 32 bit address space. This will be done
with io_remap_pfn_range() and remap_pfn_range, which requieres a vm_area_struct.

Reviewed-by: Andy Lutomirski
Signed-off-by: Stefani Seibold
Link: http://lkml.kernel.org/r/1395094933-14252-3-git-send-email-stefani@seibold.net
Signed-off-by: H. Peter Anvin

Stefani Seibold
2014-03-19 03:51:56 +0800

24 Jan, 2014

2 commits

34228d473 mm: ignore VM_SOFTDIRTY on VMA merging ... Browse Code »

The VM_SOFTDIRTY bit affects vma merge routine: if two VMAs has all bits
in vm_flags matched except dirty bit the kernel can't longer merge them
and this forces the kernel to generate new VMAs instead.

It finally may lead to the situation when userspace application reaches
vm.max_map_count limit and get crashed in worse case

| (gimp:11768): GLib-ERROR **: gmem.c:110: failed to allocate 4096 bytes
|
| (file-tiff-load:12038): LibGimpBase-WARNING **: file-tiff-load: gimp_wire_read(): error
| xinit: connection to X server lost
|
| waiting for X server to shut down
| /usr/lib64/gimp/2.0/plug-ins/file-tiff-load terminated: Hangup
| /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup
| /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup

https://bugzilla.kernel.org/show_bug.cgi?id=67651
https://bugzilla.gnome.org/show_bug.cgi?id=719619#c0

Initial problem came from missed VM_SOFTDIRTY in do_brk() routine but
even if we would set up VM_SOFTDIRTY here, there is still a way to
prevent VMAs from merging: one can call

| echo 4 > /proc/$PID/clear_refs

and clear all VM_SOFTDIRTY over all VMAs presented in memory map, then
new do_brk() will try to extend old VMA and finds that dirty bit doesn't
match thus new VMA will be generated.

As discussed with Pavel, the right approach should be to ignore
VM_SOFTDIRTY bit when we're trying to merge VMAs and if merge successed
we mark extended VMA with dirty bit where needed.

Signed-off-by: Cyrill Gorcunov
Reported-by: Bastian Hougaard
Reported-by: Mel Gorman
Cc: Pavel Emelyanov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2014-01-24 08:36:53 +0800
a64fb3cd6 mm: audit/fix non-modular users of module_init in core code ... Browse Code »

Code that is obj-y (always built-in) or dependent on a bool Kconfig
(built-in or absent) can never be modular. So using module_init as an
alias for __initcall can be somewhat misleading.

Fix these up now, so that we can relocate module_init from init.h into
module.h in the future. If we don't do this, we'd have to add module.h
to obviously non-modular code, and that would be a worse thing.

The audit targets the following module_init users for change:
mm/ksm.c bool KSM
mm/mmap.c bool MMU
mm/huge_memory.c bool TRANSPARENT_HUGEPAGE
mm/mmu_notifier.c bool MMU_NOTIFIER

Note that direct use of __initcall is discouraged, vs. one of the
priority categorized subgroups. As __initcall gets mapped onto
device_initcall, our use of subsys_initcall (which makes sense for these
files) will thus change this registration from level 6-device to level
4-subsys (i.e. slightly earlier).

However no observable impact of that difference has been observed during
testing.

One might think that core_initcall (l2) or postcore_initcall (l3) would
be more appropriate for anything in mm/ but if we look at some actual
init functions themselves, we see things like:

mm/huge_memory.c --> hugepage_init --> hugepage_init_sysfs
mm/mmap.c --> init_user_reserve --> sysctl_user_reserve_kbytes
mm/ksm.c --> ksm_init --> sysfs_create_group

and hence the choice of subsys_initcall (l4) seems reasonable, and at
the same time minimizes the risk of changing the priority too
drastically all at once. We can adjust further in the future.

Also, several instances of missing ";" at EOL are fixed.

Signed-off-by: Paul Gortmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Gortmaker
2014-01-24 08:36:52 +0800

22 Jan, 2014

2 commits

363ee17f0 mm/mmap.c: add mlock_future_check() helper ... Browse Code »

Both do_brk and do_mmap_pgoff verify that we are actually capable of
locking future pages if the corresponding VM_LOCKED flags are used.
Encapsulate this logic into a single mlock_future_check() helper
function.

Signed-off-by: Davidlohr Bueso
Cc: Rik van Riel
Reviewed-by: Michel Lespinasse
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-01-22 08:19:44 +0800
49f0ce5f9 mm: add overcommit_kbytes sysctl variable ... Browse Code »

Some applications that run on HPC clusters are designed around the
availability of RAM and the overcommit ratio is fine tuned to get the
maximum usage of memory without swapping. With growing memory, the
1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
for these workload (on a 2TB machine it represents no less than 20GB).

This patch adds the new overcommit_kbytes sysctl variable that allow a
much finer grain.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Jerome Marchand
Cc: Dave Hansen
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2014-01-22 08:19:44 +0800

15 Nov, 2013

1 commit

e1f56c89b mm: convert mm->nr_ptes to atomic_long_t ... Browse Code »

With split page table lock for PMD level we can't hold mm->page_table_lock
while updating nr_ptes.

Let's convert it to atomic_long_t to avoid races.

Signed-off-by: Kirill A. Shutemov
Tested-by: Alex Thorlton
Cc: Ingo Molnar
Cc: Naoya Horiguchi
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:14 +0800

13 Nov, 2013

4 commits

5cbb3d216 Merge branch 'akpm' (patches from Andrew Morton) ... Browse Code »

Merge first patch-bomb from Andrew Morton:
"Quite a lot of other stuff is banked up awaiting further
next->mainline merging, but this batch contains:

- Lots of random misc patches
- OCFS2
- Most of MM
- backlight updates
- lib/ updates
- printk updates
- checkpatch updates
- epoll tweaking
- rtc updates
- hfs
- hfsplus
- documentation
- procfs
- update gcov to gcc-4.7 format
- IPC"

* emailed patches from Andrew Morton : (269 commits)
ipc, msg: fix message length check for negative values
ipc/util.c: remove unnecessary work pending test
devpts: plug the memory leak in kill_sb
./Makefile: export initial ramdisk compression config option
init/Kconfig: add option to disable kernel compression
drivers: w1: make w1_slave::flags long to avoid memory corruption
drivers/w1/masters/ds1wm.cuse dev_get_platdata()
drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
drivers/memstick/core/mspro_block.c: fix attributes array allocation
drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
kernel/panic.c: reduce 1 byte usage for print tainted buffer
gcov: reuse kbasename helper
kernel/gcov/fs.c: use pr_warn()
kernel/module.c: use pr_foo()
gcov: compile specific gcov implementation based on gcc version
gcov: add support for gcc 4.7 gcov format
gcov: move gcov structs definitions to a gcc version specific file
kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
...

Linus Torvalds
2013-11-13 14:45:43 +0800
00619bcc4 mm: factor commit limit calculation ... Browse Code »

The same calculation is currently done in three differents places.
Factor that code so future changes has to be made at only one place.

[akpm@linux-foundation.org: uninline vm_commit_limit()]
Signed-off-by: Jerome Marchand
Cc: Dave Hansen
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2013-11-13 11:09:11 +0800
2afc745f3 mm: ensure get_unmapped_area() returns higher address than mmap_min_addr ... Browse Code »
2

This patch fixes the problem that get_unmapped_area() can return illegal
address and result in failing mmap(2) etc.

In case that the address higher than PAGE_SIZE is set to
/proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be
returned by get_unmapped_area(), even if you do not pass any virtual
address hint (i.e. the second argument).

This is because the current get_unmapped_area() code does not take into
account mmap_min_addr.

This leads to two actual problems as follows:

1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO,
although any illegal parameter is not passed.

2. The bottom-up search path after the top-down search might not work in
arch_get_unmapped_area_topdown().

Note: The first and third chunk of my patch, which changes "len" check,
are for more precise check using mmap_min_addr, and not for solving the
above problem.

[How to reproduce]

--- test.c -------------------------------------------------
#include
#include
#include
#include

int main(int argc, char *argv[])
{
void *ret = NULL, *last_map;
size_t pagesize = sysconf(_SC_PAGESIZE);

do {
last_map = ret;
ret = mmap(0, pagesize, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// printf("ret=%p\n", ret);
} while (ret != MAP_FAILED);

if (errno != ENOMEM) {
printf("ERR: unexpected errno: %d (last map=%p)\n",
errno, last_map);
}

return 0;
}
---------------------------------------------------------------

$ gcc -m32 -o test test.c
$ sudo sysctl -w vm.mmap_min_addr=65536
vm.mmap_min_addr = 65536
$ ./test (run as non-priviledge user)
ERR: unexpected errno: 1 (last map=0x10000)

Signed-off-by: Akira Takeuchi
Signed-off-by: Kiyoshi Owada
Reviewed-by: Naoya Horiguchi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akira Takeuchi
2013-11-13 11:09:10 +0800
4e99b0213 mmap: arch_get_unmapped_area(): use proper mmap base for bottom up direction ... Browse Code »

This is more or less the generic variant of commit 41aacc1eea64 ("x86
get_unmapped_area: Access mmap_legacy_base through mm_struct member").

So effectively architectures which use an own arch_pick_mmap_layout()
implementation but call the generic arch_get_unmapped_area() now can
also randomize their mmap_base.

All architectures which have an own arch_pick_mmap_layout() and call the
generic arch_get_unmapped_area() (arm64, s390, tile) currently set
mmap_base to TASK_UNMAPPED_BASE. This is also true for the generic
arch_pick_mmap_layout() function. So this change is a no-op currently.

Signed-off-by: Heiko Carstens
Cc: Radu Caragea
Cc: Michel Lespinasse
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Chris Metcalf
Cc: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2013-11-13 11:09:08 +0800

25 Oct, 2013

1 commit

72c2d5319 file->f_op is never NULL... ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-10-25 11:34:54 +0800

12 Sep, 2013

6 commits

2d8a17813 mm/mmap: remove unnecessary assignment ... Browse Code »

pgoff is not used after the statement "pgoff = vma->vm_pgoff;", so the
assignment is redundant.

Signed-off-by: Yanchuan Nian
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yanchuan Nian
2013-09-12 06:58:13 +0800
d9104d1ca mm: track vma changes with VM_SOFTDIRTY bit ... Browse Code »

Pavel reported that in case if vma area get unmapped and then mapped (or
expanded) in-place, the soft dirty tracker won't be able to recognize this
situation since it works on pte level and ptes are get zapped on unmap,
loosing soft dirty bit of course.

So to resolve this situation we need to track actions on vma level, there
VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded)
we set this bit, and keep it here until application calls for clearing
soft dirty bit.

Thus when user space application track memory changes now it can detect if
vma area is renewed.

Reported-by: Pavel Emelyanov
Signed-off-by: Cyrill Gorcunov
Cc: Andy Lutomirski
Cc: Matt Mackall
Cc: Xiao Guangrong
Cc: Marcelo Tosatti
Cc: KOSAKI Motohiro
Cc: Stephen Rothwell
Cc: Peter Zijlstra
Cc: "Aneesh Kumar K.V"
Cc: Rob Landley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2013-09-12 06:57:56 +0800
e86867720 mm: mmap_region: kill correct_wcount/inode, use allow_write_access() ... Browse Code »

correct_wcount and inode in mmap_region() just complicate the code. This
boolean was needed previously, when deny_write_access() was called before
vma_merge(), now we can simply check VM_DENYWRITE and do
allow_write_access() if it is set.

allow_write_access() checks file != NULL, so this is safe even if it was
possible to use VM_DENYWRITE && !file. Just we need to ensure we use the
same file which was deny_write_access()'ed, so the patch also moves "file
= vma->vm_file" down after allow_write_access().

Signed-off-by: Oleg Nesterov
Cc: Hugh Dickins
Cc: Al Viro
Cc: Colin Cross
Cc: David Rientjes
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:57:07 +0800
077bf22b5 mm: do_mmap_pgoff: cleanup the usage of file_inode() ... Browse Code »

Simple cleanup. Move "struct inode *inode" variable into "if (file)"
block to simplify the code and avoid the unnecessary check.

Signed-off-by: Oleg Nesterov
Cc: Hugh Dickins
Cc: Al Viro
Cc: Colin Cross
Cc: David Rientjes
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:57:07 +0800
b2c56e4f7 mm: shift VM_GROWS* check from mmap_region() to do_mmap_pgoff() ... Browse Code »

mmap() doesn't allow the non-anonymous mappings with VM_GROWS* bit set.
In particular this means that mmap_region()->vma_merge(file, vm_flags)
must always fail if "vm_flags & VM_GROWS" is set incorrectly.

So it does not make sense to check VM_GROWS* after we already allocated
the new vma, the only caller, do_mmap_pgoff(), which can pass this flag
can do the check itself.

And this looks a bit more correct, mmap_region() already unmapped the
old mapping at this stage. But if mmap() is going to fail, it should
avoid do_munmap() if possible.

Note: we check VM_GROWS at the end to ensure that do_mmap_pgoff() won't
return EINVAL in the case when it currently returns another error code.

Many thanks to Hugh who nacked the buggy v1.

Signed-off-by: Oleg Nesterov
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:57:05 +0800
ef0855d33 mm: mempolicy: turn vma_set_policy() into vma_dup_policy() ... Browse Code »
16

Simple cleanup. Every user of vma_set_policy() does the same work, this
looks a bit annoying imho. And the new trivial helper which does
mpol_dup() + vma_set_policy() to simplify the callers.

Signed-off-by: Oleg Nesterov
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-09-12 06:57:00 +0800

16 Aug, 2013

1 commit

2b047252d Fix TLB gather virtual address range invalidation corner cases ... Browse Code »

Ben Tebulin reported:

"Since v3.7.2 on two independent machines a very specific Git
repository fails in 9/10 cases on git-fsck due to an SHA1/memory
failures. This only occurs on a very specific repository and can be
reproduced stably on two independent laptops. Git mailing list ran
out of ideas and for me this looks like some very exotic kernel issue"

and bisected the failure to the backport of commit 53a59fc67f97 ("mm:
limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").

That commit itself is not actually buggy, but what it does is to make it
much more likely to hit the partial TLB invalidation case, since it
introduces a new case in tlb_next_batch() that previously only ever
happened when running out of memory.

The real bug is that the TLB gather virtual memory range setup is subtly
buggered. It was introduced in commit 597e1c3580b7 ("mm/mmu_gather:
enable tlb flush range in generic mmu_gather"), and the range handling
was already fixed at least once in commit e6c495a96ce0 ("mm: fix the TLB
range flushed when __tlb_remove_page() runs out of slots"), but that fix
was not complete.

The problem with the TLB gather virtual address range is that it isn't
set up by the initial tlb_gather_mmu() initialization (which didn't get
the TLB range information), but it is set up ad-hoc later by the
functions that actually flush the TLB. And so any such case that forgot
to update the TLB range entries would potentially miss TLB invalidates.

Rather than try to figure out exactly which particular ad-hoc range
setup was missing (I personally suspect it's the hugetlb case in
zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
did), this patch just gets rid of the problem at the source: make the
TLB range information available to tlb_gather_mmu(), and initialize it
when initializing all the other tlb gather fields.

This makes the patch larger, but conceptually much simpler. And the end
result is much more understandable; even if you want to play games with
partial ranges when invalidating the TLB contents in chunks, now the
range information is always there, and anybody who doesn't want to
bother with it won't introduce subtle bugs.

Ben verified that this fixes his problem.

Reported-bisected-and-tested-by: Ben Tebulin
Build-testing-by: Stephen Rothwell
Build-testing-by: Richard Weinberger
Reviewed-by: Michal Hocko
Acked-by: Peter Zijlstra
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-08-16 23:52:46 +0800

01 Aug, 2013

1 commit

3964acd0d mm: mempolicy: fix mbind_range() && vma_adjust() interaction ... Browse Code »

vma_adjust() does vma_set_policy(vma, vma_policy(next)) and this
is doubly wrong:

1. This leaks vma->vm_policy if it is not NULL and not equal to
next->vm_policy.

This can happen if vma_merge() expands "area", not prev (case 8).

2. This sets the wrong policy if vma_merge() joins prev and area,
area is the vma the caller needs to update and it still has the
old policy.

Revert commit 1444f92c8498 ("mm: merging memory blocks resets
mempolicy") which introduced these problems.

Change mbind_range() to recheck mpol_equal() after vma_merge() to fix
the problem that commit tried to address.

Signed-off-by: Oleg Nesterov
Acked-by: KOSAKI Motohiro
Cc: Steven T Hampson
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Cc: Andi Kleen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-08-01 05:41:02 +0800

11 Jul, 2013

1 commit

98d1e64f9 mm: remove free_area_cache ... Browse Code »

Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Cc: "James E.J. Bottomley"
Cc: "Luck, Tony"
Cc: Benjamin Herrenschmidt
Cc: David Howells
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Matt Turner
Cc: Paul Mackerras
Cc: Richard Henderson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-07-11 09:11:34 +0800

10 Jul, 2013

2 commits

c103a4dc4 ipc/shmc.c: eliminate ugly 80-col tricks ... Browse Code »

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-07-10 01:33:26 +0800
493af5780 mmap: allow MAP_HUGETLB for hugetlbfs files v2 ... Browse Code »

It is counterintuitive at best that mmap'ing a hugetlbfs file with
MAP_HUGETLB fails, while mmap'ing it without will a) succeed and b)
return huge pages.

v2: use is_file_hugepages(), as suggested by Jianguo

Signed-off-by: Joern Engel
Cc: Jianguo Wu
Signed-off-by: Linus Torvalds

Jörn Engel
2013-07-10 01:33:24 +0800

04 Jul, 2013

1 commit

d6e932177 mm: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT ... Browse Code »

(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
as a inline funcion vma_pages() in linux/mm.h, so using it.

Signed-off-by: Libin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Libin
2013-07-04 07:07:26 +0800

10 May, 2013

1 commit

091d0d55b shm: fix null pointer deref when userspace specifies invalid hugepage size ... Browse Code »

Dave reported an oops triggered by trinity:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: newseg+0x10d/0x390
PGD cf8c1067 PUD cf8c2067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 2 PID: 7636 Comm: trinity-child2 Not tainted 3.9.0+#67
...
Call Trace:
ipcget+0x182/0x380
SyS_shmget+0x5a/0x60
tracesys+0xdd/0xe2

This bug was introduced by commit af73e4d9506d ("hugetlbfs: fix mmap
failure in unaligned size request").

Reported-by: Dave Jones
Cc:
Signed-off-by: Li Zefan
Reviewed-by: Naoya Horiguchi
Acked-by: Rik van Riel
Signed-off-by: Linus Torvalds

Li Zefan
2013-05-10 05:22:47 +0800

08 May, 2013

1 commit

af73e4d95 hugetlbfs: fix mmap failure in unaligned size request ... Browse Code »

The current kernel returns -EINVAL unless a given mmap length is
"almost" hugepage aligned. This is because in sys_mmap_pgoff() the
given length is passed to vm_mmap_pgoff() as it is without being aligned
with hugepage boundary.

This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix
alignment of huge page requests"), where alignment code is pushed into
hugetlb_file_setup() and the variable len in caller side is not changed.

To fix this, this patch partially reverts that commit, and adds
alignment code in caller side. And it also introduces hstate_sizelog()
in order to get proper hstate to specified hugepage size.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881

[akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n]
Signed-off-by: Naoya Horiguchi
Signed-off-by: Johannes Weiner
Reported-by:
Cc: Steven Truelove
Cc: Jianguo Wu
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-05-08 09:38:27 +0800

30 Apr, 2013

7 commits

e8420a8ec mm/mmap: check for RLIMIT_AS before unmapping ... Browse Code »

Fix a corner case for MAP_FIXED when requested mapping length is larger
than rlimit for virtual memory. In such case any overlapping mappings
are unmapped before we check for the limit and return ENOMEM.

The check is moved before the loop that unmaps overlapping parts of
existing mappings. When we are about to hit the limit (currently mapped
pages + len > limit) we scan for overlapping pages and check again
accounting for them.

This fixes situation when userspace program expects that the previous
mappings are preserved after the mmap() syscall has returned with error.
(POSIX clearly states that successfull mapping shall replace any
previous mappings.)

This corner case was found and can be tested with LTP testcase:

testcases/open_posix_testsuite/conformance/interfaces/mmap/24-2.c

In this case the mmap, which is clearly over current limit, unmaps
dynamic libraries and the testcase segfaults right after returning into
userspace.

I've also looked at the second instance of the unmapping loop in the
do_brk(). The do_brk() is called from brk() syscall and from vm_brk().
The brk() syscall checks for overlapping mappings and bails out when
there are any (so it can't be triggered from the brk syscall). The
vm_brk() is called only from binmft handlers so it shouldn't be
triggered unless binmft handler created overlapping mappings.

Signed-off-by: Cyril Hrubis
Reviewed-by: Mel Gorman
Reviewed-by: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyril Hrubis
2013-04-30 06:54:38 +0800
1640879af mm: reinititalise user and admin reserves if memory is added or removed ... Browse Code »

Alter the admin and user reserves of the previous patches in this series
when memory is added or removed.

If memory is added and the reserves have been eliminated or increased
above the default max, then we'll trust the admin.

If memory is removed and there isn't enough free memory, then we need to
reset the reserves.

Otherwise keep the reserve set by the admin.

The reserve reset code is the same as the reserve initialization code.

I tested hot addition and removal by triggering it via sysfs. The
reserves shrunk when they were set high and memory was removed. They
were reset higher when memory was added again.

[akpm@linux-foundation.org: use register_hotmemory_notifier()]
[akpm@linux-foundation.org: init_user_reserve() and init_admin_reserve can no longer be __meminit]
[fengguang.wu@intel.com: make init_reserve_notifier() static]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew Shewmaker
Signed-off-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Shewmaker
2013-04-30 06:54:37 +0800
4eeab4f55 mm: replace hardcoded 3% with admin_reserve_pages knob ... Browse Code »

Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
memory reserve to something other than 3%, which may be multiple
gigabytes on large memory systems. Only about 8MB is necessary to
enable recovery in the default mode, and only a few hundred MB are
required even when overcommit is disabled.

This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.

admin_reserve_kbytes is initialized to min(3% free pages, 8MB)

I arrived at 8MB by summing the RSS of sshd or login, bash, and top.

Please see first patch in this series for full background, motivation,
testing, and full changelog.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_admin_reserve() static]
Signed-off-by: Andrew Shewmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Shewmaker
2013-04-30 06:54:36 +0800
c9b1d0981 mm: limit growth of 3% hardcoded other user reserve ... Browse Code »
30

Add user_reserve_kbytes knob.

Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.

user_reserve_pages defaults to min(3% free pages, 128MB)

I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.

This only affects OVERCOMMIT_NEVER mode.

Background

1. user reserve

__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:

bash: fork: Cannot allocate memory

2. admin reserve

Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.

Note that this reserve shrinks, and doesn't guarantee a useful reserve.

Motivation

The two hardcoded memory reserves should be updated to account for current
memory sizes.

Also, the admin reserve would be more useful if it didn't shrink too much.

When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.

I've found that reducing these reserves is especially beneficial for a
specific type of application load:

* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running

I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.

However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.

The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.

Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.

Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.

Risks

* "bash: fork: Cannot allocate memory"

The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.

Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.

* root-cant-log-in problem

The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.

However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.

Alternatives

* Memory cgroups provide a more flexible way to limit application memory.

Not everyone wants to set up cgroups or deal with their overhead.

* We could create a fourth overcommit mode which provides smaller reserves.

The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.

* Force users to initialize all of their memory or use calloc.

Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.

The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.

FAQ

* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?

Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.

On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'

admin_reserve_pages defaults to min(3% free memory, 8MB)

So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.

* How do you calculate a minimum useful reserve?

A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:

sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.

For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.

When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.

When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.

* What happens if user_reserve_pages is set to 0?

Note, this only affects overcomitt 'never' mode.

Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.

However, they will easily see a message such as:

"bash: fork: Cannot allocate memory"

And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.

* What's the difference between overcommit 'guess' and 'never'?

"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.

"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.

* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?

Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.

Test Summary

There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.

Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.

Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.

Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.

With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:

1. maximize user-allocatable memory, running close to the edge of
recoverability

2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system

Test Description

Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap

System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.

Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo

In overcommit 'never' mode, memory_ratio=100

Test Results

3.9.0-rc1-mm1

Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no

never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes

3.9.0-rc1-mm1-tunablereserves

User and Admin Recovery show their respective reserves, if applicable.

Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no

* process would successfully mlock, then the oom killer would pick it

never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely

never no 1 5359/5448 no 10MB no 10MB barely

never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes

never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes

* more memtesters were launched, able to allocate approximately another 100MB

Future Work

- Test larger memory systems.

- Test an embedded image.

- Test other architectures.

- Time malloc microbenchmarks.

- Would it be useful to be able to set overcommit policy for
each memory cgroup?

- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Shewmaker
2013-04-30 06:54:36 +0800
1444f92c8 mm: merging memory blocks resets mempolicy ... Browse Code »

Using mbind to change the mempolicy to MPOL_BIND on several adjacent
mmapped blocks may result in a reset of the mempolicy to MPOL_DEFAULT in
vma_adjust.

Test code. Correct result is three lines containing "OK".

#include
#include
#include
#include
#include

/* gcc mbind_test.c -lnuma -o mbind_test -Wall */
#define MAXNODE 4096

void allocate()
{
int ret;
int len;
int policy = -1;
unsigned char *p;
unsigned long mask[MAXNODE] = { 0 };
unsigned long retmask[MAXNODE] = { 0 };

len = getpagesize() * 0x2fc00;
p = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0);
if (p == MAP_FAILED)
printf("mbind err: %d\n", errno);

mask[0] = 1;
ret = mbind(p, len, MPOL_BIND, mask, MAXNODE, 0);
if (ret < 0)
printf("mbind err: %d %d\n", ret, errno);
ret = get_mempolicy(&policy, retmask, MAXNODE, p, MPOL_F_ADDR);
if (ret < 0)
printf("get_mempolicy err: %d %d\n", ret, errno);

if (policy == MPOL_BIND)
printf("OK\n");
else
printf("ERROR: policy is %d\n", policy);
}

int main()
{
allocate();
allocate();
allocate();
return 0;
}

Signed-off-by: Steven T Hampson
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hampson, Steven T
2013-04-30 06:54:35 +0800
6ee8630e0 mm: allow arch code to control the user page table ceiling ... Browse Code »

On architectures where a pgd entry may be shared between user and kernel
(e.g. ARM+LPAE), freeing page tables needs a ceiling other than 0.
This patch introduces a generic USER_PGTABLES_CEILING that arch code can
override. It is the responsibility of the arch code setting the ceiling
to ensure the complete freeing of the page tables (usually in
pgd_free()).

[catalin.marinas@arm.com: commit log; shift_arg_pages(), asm-generic/pgtables.h changes]
Signed-off-by: Hugh Dickins
Signed-off-by: Catalin Marinas
Cc: Russell King
Cc: [3.3+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2013-04-30 06:54:34 +0800
ee5df0570 mmap: find_vma: remove the WARN_ON_ONCE(!mm) check ... Browse Code »

Remove the WARN_ON_ONCE(!mm) check as the comment suggested. Kernel
code calls find_vma only when it is absolutely sure that the mm_struct
arg to it is non-NULL.

Signed-off-by: Zhang Yanfei
Cc: k80c
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-04-30 06:54:34 +0800

05 Apr, 2013

1 commit

b6a9b7f6b mm: prevent mmap_cache race in find_vma() ... Browse Code »

find_vma() can be called by multiple threads with read lock
held on mm->mmap_sem and any of them can update mm->mmap_cache.
Prevent compiler from re-fetching mm->mmap_cache, because other
readers could update it in the meantime:

thread 1 thread 2
|
find_vma() | find_vma()
struct vm_area_struct *vma = NULL; |
vma = mm->mmap_cache; |
if (!(vma && vma->vm_end > addr |
&& vma->vm_start mmap_cache = vma;
return vma; |
^^ compiler may optimize this |
local variable out and re-read |
mm->mmap_cache |

This issue can be reproduced with gcc-4.8.0-1 on s390x by running
mallocstress testcase from LTP, which triggers:

kernel BUG at mm/rmap.c:1088!
Call Trace:
([] 0x3d100c57000)
[] do_wp_page+0x2fc/0xa88
[] handle_pte_fault+0x41a/0xac8
[] handle_mm_fault+0x17a/0x268
[] do_protection_exception+0x1e2/0x394
[] pgm_check_handler+0x138/0x13c
[] 0x3fffcf1f07a
Last Breaking-Event-Address:
[] page_add_new_anon_rmap+0xc2/0x168

Thanks to Jakub Jelinek for his insight on gcc and helping to
track this down.

Signed-off-by: Jan Stancek
Acked-by: David Rientjes
Signed-off-by: Hugh Dickins
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Jan Stancek
2013-04-05 02:46:28 +0800

29 Mar, 2013

1 commit

09a9f1d27 Revert "mm: introduce VM_POPULATE flag to better deal with racy userspace programs" ... Browse Code »

This reverts commit 186930500985 ("mm: introduce VM_POPULATE flag to
better deal with racy userspace programs").

VM_POPULATE only has any effect when userspace plays racy games with
vmas by trying to unmap and remap memory regions that mmap or mlock are
operating on.

Also, the only effect of VM_POPULATE when userspace plays such games is
that it avoids populating new memory regions that get remapped into the
address range that was being operated on by the original mmap or mlock
calls.

Let's remove VM_POPULATE as there isn't any strong argument to mandate a
new vm_flag.

Signed-off-by: Michel Lespinasse
Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-03-29 08:45:51 +0800

28 Feb, 2013

1 commit

098849643 mm: do not grow the stack vma just because of an overrun on preceding vma ... Browse Code »

The stack vma is designed to grow automatically (marked with VM_GROWSUP
or VM_GROWSDOWN depending on architecture) when an access is made beyond
the existing boundary. However, particularly if you have not limited
your stack at all ("ulimit -s unlimited"), this can cause the stack to
grow even if the access was really just one past *another* segment.

And that's wrong, especially since we first grow the segment, but then
immediately later enforce the stack guard page on the last page of the
segment. So _despite_ first growing the stack segment as a result of
the access, the kernel will then make the access cause a SIGSEGV anyway!

So do the same logic as the guard page check does, and consider an
access to within one page of the next segment to be a bad access, rather
than growing the stack to abut the next segment.

Reported-and-tested-by: Heiko Carstens
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-02-28 00:36:04 +0800

27 Feb, 2013

1 commit

d895cb1af Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.

The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.

Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.

PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...

Linus Torvalds
2013-02-27 12:16:07 +0800