Eric Lee / smarc-fsl-linux-kernel

03 Jul, 2015

1 commit

9d90f0353 Merge tag 'module_init-alternate_initcall-v4.1-rc8' of git://git.kernel.org/pub/… ... Browse Code »

…scm/linux/kernel/git/paulg/linux

Pull module_init replacement part two from Paul Gortmaker:
"Replace module_init with appropriate alternate initcall in non
modules.

This series converts non-modular code that is using the module_init()
call to hook itself into the system to instead use one of our
alternate priority initcalls.

Unlike the previous series that used device_initcall and hence was a
runtime no-op, these commits change to one of the alternate initcalls,
because (a) we have them and (b) it seems like the right thing to do.

For example, it would seem logical to use arch_initcall for arch
specific setup code and fs_initcall for filesystem setup code.

This does mean however, that changes in the init ordering will be
taking place, and so there is a small risk that some kind of implicit
init ordering issue may lie uncovered. But I think it is still better
to give these ones sensible priorities than to just assign them all to
device_initcall in order to exactly preserve the old ordering.

Thad said, we have already made similar changes in core kernel code in
commit c96d6660dc65 ("kernel: audit/fix non-modular users of
module_init in core code") without any regressions reported, so this
type of change isn't without precedent. It has also got the same
local testing and linux-next coverage as all the other pull requests
that I'm sending for this merge window have got.

Once again, there is an unused module_exit function removal that shows
up as an outlier upon casual inspection of the diffstat"

* tag 'module_init-alternate_initcall-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
x86: perf_event_intel_pt.c: use arch_initcall to hook in enabling
x86: perf_event_intel_bts.c: use arch_initcall to hook in enabling
mm/page_owner.c: use late_initcall to hook in enabling
lib/list_sort: use late_initcall to hook in self tests
arm: use subsys_initcall in non-modular pl320 IPC code
powerpc: don't use module_init for non-modular core hugetlb code
powerpc: use subsys_initcall for Freescale Local Bus
x86: don't use module_init for non-modular core bootflag code
netfilter: don't use module_init/exit in core IPV4 code
fs/notify: don't use module_init for non-modular inotify_user code
mm: replace module_init usages with subsys_initcall in nommu.c

Linus Torvalds
2015-07-03 01:36:29 +0800

25 Jun, 2015

1 commit

22cc877b3 mm: nommu: refactor debug and warning prints ... Browse Code »

kenter/kleave/kdebug are wrapper macros to print functions flow and debug
information. This set was written before pr_devel() was introduced, so it
was controlled by "#if 0" construction. It is questionable if anyone is
using them [1] now.

This patch removes these macros, converts numerous printk(KERN_WARNING,
...) to use general pr_warn(...) and removes debug print line from
validate_mmap_request() function.

Signed-off-by: Leon Romanovsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Leon Romanovsky
2015-06-25 08:49:44 +0800

17 Jun, 2015

1 commit

a4bc6fc79 mm: replace module_init usages with subsys_initcall in nommu.c ... Browse Code »

Compiling some arm/m68k configs with "# CONFIG_MMU is not set" reveals
two more instances of module_init being used for code that can't
possibly be modular, as CONFIG_MMU is either on or off.

We replace them with subsys_initcall as per what was done in other
mmu-enabled code.

Note that direct use of __initcall is discouraged, vs. one of the
priority categorized subgroups. As __initcall gets mapped onto
device_initcall, our use of subsys_initcall (which makes sense for these
files) will thus change this registration from level 6-device to level
4-subsys (i.e. slightly earlier).

One might think that core_initcall (l2) or postcore_initcall (l3) would
be more appropriate for anything in mm/ but if we look at the actual init
functions themselves, we see they are just sysctl setup stuff, and
hence the choice of subsys_initcall (l4) seems reasonable. At the same
time it minimizes the risk of changing the priority too drastically all
at once. We can adjust further in the future.

Also, a couple instances of missing ";" at EOL are fixed.

Cc: Andrew Morton
Cc: linux-mm@kvack.org
Signed-off-by: Paul Gortmaker

Paul Gortmaker
2015-06-17 02:12:34 +0800

12 Apr, 2015

1 commit

6e242a1ce nommu: use __vfs_read() ... Browse Code »

... instead of open-coding the call of ->read()

Signed-off-by: Al Viro

Al Viro
2015-04-12 10:27:56 +0800

13 Mar, 2015

1 commit

5b8bf3072 mm/nommu.c: export symbol max_mapnr ... Browse Code »

Several modules may need max_mapnr, so export, the related error with
allmodconfig under c6x:

MODPOST 3327 modules
ERROR: "max_mapnr" [fs/pstore/ramoops.ko] undefined!
ERROR: "max_mapnr" [drivers/media/v4l2-core/videobuf2-dma-contig.ko] undefined!

Signed-off-by: Chen Gang
Cc: Mark Salter
Cc: Aurelien Jacquiot
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

gchen gchen
2015-03-13 09:46:08 +0800

01 Mar, 2015

1 commit

da616534e mm/nommu: fix memory leak ... Browse Code »

Maxime reported the following memory leak regression due to commit
dbc8358c7237 ("mm/nommu: use alloc_pages_exact() rather than its own
implementation").

On v3.19, I am facing a memory leak. Each time I run a command one page
is lost. Here an example with busybox's free command:

/ # free
total used free shared buffers cached
Mem: 7928 1972 5956 0 0 492
-/+ buffers/cache: 1480 6448
/ # free
total used free shared buffers cached
Mem: 7928 1976 5952 0 0 492
-/+ buffers/cache: 1484 6444
/ # free
total used free shared buffers cached
Mem: 7928 1980 5948 0 0 492
-/+ buffers/cache: 1488 6440
/ # free
total used free shared buffers cached
Mem: 7928 1984 5944 0 0 492
-/+ buffers/cache: 1492 6436
/ # free
total used free shared buffers cached
Mem: 7928 1988 5940 0 0 492
-/+ buffers/cache: 1496 6432

At some point, the system fails to sastisfy 256KB allocations:

free: page allocation failure: order:6, mode:0xd0
CPU: 0 PID: 67 Comm: free Not tainted 3.19.0-05389-gacf2cf1-dirty #64
Hardware name: STM32 (Device Tree Support)
show_stack+0xb/0xc
warn_alloc_failed+0x97/0xbc
__alloc_pages_nodemask+0x295/0x35c
__get_free_pages+0xb/0x24
alloc_pages_exact+0x19/0x24
do_mmap_pgoff+0x423/0x658
vm_mmap_pgoff+0x3f/0x4e
load_flat_file+0x20d/0x4f8
load_flat_binary+0x3f/0x26c
search_binary_handler+0x51/0xe4
do_execveat_common+0x271/0x35c
do_execve+0x19/0x1c
ret_fast_syscall+0x1/0x4a
Mem-info:
Normal per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:0 inactive_file:0 isolated_file:0
unevictable:123 dirty:0 writeback:0 unstable:0
free:1515 slab_reclaimable:17 slab_unreclaimable:139
mapped:0 shmem:0 pagetables:0 bounce:0
free_cma:0
Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
lowmem_reserve[]: 0 0
Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
123 total pagecache pages
2048 pages of RAM
1538 free pages
66 reserved pages
109 slab pages
-46 pages shared
0 pages swap cached
nommu: Allocation of length 221184 from process 67 (free) failed
Normal per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:0 inactive_file:0 isolated_file:0
unevictable:123 dirty:0 writeback:0 unstable:0
free:1515 slab_reclaimable:17 slab_unreclaimable:139
mapped:0 shmem:0 pagetables:0 bounce:0
free_cma:0
Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
lowmem_reserve[]: 0 0
Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
123 total pagecache pages
Unable to allocate RAM for process text/data, errno 12 SEGV

This problem happens because we allocate ordered page through
__get_free_pages() in do_mmap_private() in some cases and we try to free
individual pages rather than ordered page in free_page_series(). In
this case, freeing pages whose refcount is not 0 won't be freed to the
page allocator so memory leak happens.

To fix the problem, this patch changes __get_free_pages() to
alloc_pages_exact() since alloc_pages_exact() returns
physically-contiguous pages but each pages are refcounted.

Fixes: dbc8358c7237 ("mm/nommu: use alloc_pages_exact() rather than its own implementation").
Reported-by: Maxime Coquelin
Tested-by: Maxime Coquelin
Signed-off-by: Joonsoo Kim
Cc: [3.19]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2015-03-01 01:57:50 +0800

13 Feb, 2015

1 commit

6bec00352 Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-block ... Browse Code »

Pull backing device changes from Jens Axboe:
"This contains a cleanup of how the backing device is handled, in
preparation for a rework of the life time rules. In this part, the
most important change is to split the unrelated nommu mmap flags from
it, but also removing a backing_dev_info pointer from the
address_space (and inode), and a cleanup of other various minor bits.

Christoph did all the work here, I just fixed an oops with pages that
have a swap backing. Arnd fixed a missing export, and Oleg killed the
lustre backing_dev_info from staging. Last patch was from Al,
unexporting parts that are now no longer needed outside"

* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
Make super_blocks and sb_lock static
mtd: export new mtd_mmap_capabilities
fs: make inode_to_bdi() handle NULL inode
staging/lustre/llite: get rid of backing_dev_info
fs: remove default_backing_dev_info
fs: don't reassign dirty inodes to default_backing_dev_info
nfs: don't call bdi_unregister
ceph: remove call to bdi_unregister
fs: remove mapping->backing_dev_info
fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
nilfs2: set up s_bdi like the generic mount_bdev code
block_dev: get bdev inode bdi directly from the block device
block_dev: only write bdev inode on close
fs: introduce f_op->mmap_capabilities for nommu mmap support
fs: kill BDI_CAP_SWAP_BACKED
fs: deduplicate noop_backing_dev_info

Linus Torvalds
2015-02-13 05:50:21 +0800

12 Feb, 2015

3 commits

8138a67a5 mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() ... Browse Code »

I noticed that "allowed" can easily overflow by falling below 0, because
(total_vm / 32) can be larger than "allowed". The problem occurs in
OVERCOMMIT_NONE mode.

In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode). All subsequent allocations will fall
(system-wide), so system become unusable.

The problem was masked out by commit c9b1d0981fcc
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory

It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.

Fix this issue by switching to signed arithmetic here.

Signed-off-by: Roman Gushchin
Cc: Andrew Shewmaker
Cc: Rik van Riel
Cc: Konstantin Khlebnikov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2015-02-12 09:06:07 +0800
0fd71a56f mm: gup: add __get_user_pages_unlocked to customize gup_flags ... Browse Code »

Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
to get a proper -EHWPOSION retval instead of -EFAULT to take a more
appropriate action if get_user_pages runs into a memory failure.

Signed-off-by: Andrea Arcangeli
Reviewed-by: Kirill A. Shutemov
Cc: Andres Lagar-Cavilla
Cc: Peter Feiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-02-12 09:06:05 +0800
f0818f472 mm: gup: add get_user_pages_locked and get_user_pages_unlocked ... Browse Code »

FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
reading to reduce the mmap_sem contention (for writing), like while
waiting for I/O completion. The problem is that right now practically no
get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
that nifty feature.

Andres fixed it for the KVM page fault. However get_user_pages_fast
remains uncovered, and 99% of other get_user_pages aren't using it either
(the only exception being FOLL_NOWAIT in KVM which is really nonblocking
and in fact it doesn't even release the mmap_sem).

So this patchsets extends the optimization Andres did in the KVM page
fault to the whole kernel. It makes most important places (including
gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
during I/O.

The only few places that remains uncovered are drivers like v4l and other
exceptions that tends to work on their own memory and they're not working
on random user memory (for example like O_DIRECT that uses gup_fast and is
fully covered by this patch).

A follow up patch should probably also add a printk_once warning to
get_user_pages that should go obsolete and be phased out eventually. The
"vmas" parameter of get_user_pages makes it fundamentally incompatible
with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
mmap_sem is released).

While this is just an optimization, this becomes an absolute requirement
for the userfaultfd feature http://lwn.net/Articles/615086/ .

The userfaultfd allows to block the page fault, and in order to do so I
need to drop the mmap_sem first. So this patch also ensures that all
memory where userfaultfd could be registered by KVM, the very first fault
(no matter if it is a regular page fault, or a get_user_pages) always has
FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken
only when the pagetable is already mapped. The second fault attempt after
the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
without it.

This patch (of 5):

We can leverage the VM_FAULT_RETRY functionality in the page fault paths
better by using either get_user_pages_locked or get_user_pages_unlocked.

The former allows conversion of get_user_pages invocations that will have
to pass a "&locked" parameter to know if the mmap_sem was dropped during
the call. Example from:

down_read(&mm->mmap_sem);
do_something()
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

to:

int locked = 1;
down_read(&mm->mmap_sem);
do_something()
get_user_pages_locked(tsk, mm, ..., pages, &locked);
if (locked)
up_read(&mm->mmap_sem);

The latter is suitable only as a drop in replacement of the form:

down_read(&mm->mmap_sem);
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

into:

get_user_pages_unlocked(tsk, mm, ..., pages);

Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must be
NULL for get_user_pages_locked|unlocked to be usable (the latter original
form wouldn't have been safe anyway if vmas wasn't null, for the former we
just make it explicit by dropping the parameter).

If vmas is not NULL these two methods cannot be used.

Signed-off-by: Andrea Arcangeli
Reviewed-by: Andres Lagar-Cavilla
Reviewed-by: Peter Feiner
Reviewed-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-02-12 09:06:05 +0800

11 Feb, 2015

1 commit

c8d78c182 mm: replace remap_file_pages() syscall with emulation ... Browse Code »

remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.

Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.

Let's drop it and get rid of all these special-cased code.

The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.

I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.

There are few basic tests in LTP for the syscall. They work just fine
with emulation.

To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.

The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.

Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88x

I believe we can live with that.

Test case:

#define _GNU_SOURCE
#include
#include
#include
#include

#define MB (1024UL * 1024)
#define SIZE (4096 * MB)

int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;

for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}

for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;

for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}

for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);

munmap(p, SIZE);
}

return 0;
}

[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov"
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Dave Jones
Cc: Linus Torvalds
Cc: Armin Rigo
Signed-off-by: Sasha Levin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-11 06:30:30 +0800

06 Feb, 2015

1 commit

944b68749 mm: export "high_memory" symbol on !MMU ... Browse Code »

The symbol 'high_memory' is provided on both MMU- and NOMMU-kernels, but
only one of them is exported, which leads to module build errors in
drivers that work fine built-in:

ERROR: "high_memory" [drivers/net/virtio_net.ko] undefined!
ERROR: "high_memory" [drivers/net/ppp/ppp_mppe.ko] undefined!
ERROR: "high_memory" [drivers/mtd/nand/nand.ko] undefined!
ERROR: "high_memory" [crypto/tcrypt.ko] undefined!
ERROR: "high_memory" [crypto/cts.ko] undefined!

This exports the symbol to get these to work on NOMMU as well.

Signed-off-by: Arnd Bergmann
Cc: Kirill A. Shutemov
Acked-by: Greg Ungerer
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2015-02-06 05:35:29 +0800

21 Jan, 2015

1 commit

b4caecd48 fs: introduce f_op->mmap_capabilities for nommu mmap support ... Browse Code »

Since "BDI: Provide backing device capability information [try #3]" the
backing_dev_info structure also provides flags for the kind of mmap
operation available in a nommu environment, which is entirely unrelated
to it's original purpose.

Introduce a new nommu-only file operation to provide this information to
the nommu mmap code instead. Splitting this from the backing_dev_info
structure allows to remove lots of backing_dev_info instance that aren't
otherwise needed, and entirely gets rid of the concept of providing a
backing_dev_info for a character device. It also removes the need for
the mtd_inodefs filesystem.

Signed-off-by: Christoph Hellwig
Reviewed-by: Tejun Heo
Acked-by: Brian Norris
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-01-21 05:02:58 +0800

14 Dec, 2014

3 commits

dbc8358c7 mm/nommu: use alloc_pages_exact() rather than its own implementation ... Browse Code »

do_mmap_private() in nommu.c try to allocate physically contiguous pages
with arbitrary size in some cases and we now have good abstract function
to do exactly same thing, alloc_pages_exact(). So, change to use it.

There is no functional change. This is the preparation step for support
page owner feature accurately.

Signed-off-by: Joonsoo Kim
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Dave Hansen
Cc: Michal Nazarewicz
Cc: Jungsoo Son
Cc: Ingo Molnar
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-12-14 04:42:48 +0800
1acf2e040 mm/nommu: share the i_mmap_rwsem ... Browse Code »

Shrinking/truncate logic can call nommu_shrink_inode_mappings() to verify
that any shared mappings of the inode in question aren't broken (dead
zone). afaict the only user being ramfs to handle the size change
attribute.

Pretty much a no-brainer to share the lock.

Signed-off-by: Davidlohr Bueso
Acked-by: "Kirill A. Shutemov"
Acked-by: Hugh Dickins
Cc: Oleg Nesterov
Acked-by: Peter Zijlstra (Intel)
Cc: Rik van Riel
Cc: Srikar Dronamraju
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-12-14 04:42:46 +0800
83cde9e8b mm: use new helper functions around the i_mmap_mutex ... Browse Code »

Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.

Signed-off-by: Davidlohr Bueso
Acked-by: Rik van Riel
Acked-by: "Kirill A. Shutemov"
Acked-by: Hugh Dickins
Cc: Oleg Nesterov
Acked-by: Peter Zijlstra (Intel)
Cc: Srikar Dronamraju
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-12-14 04:42:45 +0800

08 Sep, 2014

1 commit

908c7f194 percpu_counter: add @gfp to percpu_counter_init() ... Browse Code »

Percpu allocator now supports allocation mask. Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.

We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion. This is the one with
the most users. Other percpu data structures are a lot easier to
convert.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo
Acked-by: Jan Kara
Acked-by: "David S. Miller"
Cc: x86@kernel.org
Cc: Jens Axboe
Cc: "Theodore Ts'o"
Cc: Alexander Viro
Cc: Andrew Morton

Tejun Heo
2014-09-08 08:51:29 +0800

09 Aug, 2014

1 commit

a6c19dfe3 arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area ... Browse Code »

The core mm code will provide a default gate area based on
FIXADDR_USER_START and FIXADDR_USER_END if
!defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).

This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit
UML, and x86_32 have their own code just to disable it. arm, 32-bit UML,
and x86_64 have gate areas, but they have their own implementations.

This gets rid of the default and moves the code into ia64.

This should save some code on architectures without a gate area: it's now
possible to inline the gate_area functions in the default case.

Signed-off-by: Andy Lutomirski
Acked-by: Nathan Lynch
Acked-by: H. Peter Anvin
Acked-by: Benjamin Herrenschmidt [in principle]
Acked-by: Richard Weinberger [for um]
Acked-by: Will Deacon [for arm64]
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Chris Metcalf
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Nathan Lynch
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Lutomirski
2014-08-09 06:57:27 +0800

24 Jun, 2014

1 commit

e020d5bd8 mm: nommu: per-thread vma cache fix ... Browse Code »

mm could be removed from current task struct, using previous vma->vm_mm

It will crash on blackfin after updated to Linux 3.15. The commit "mm:
per-thread vma caching" caused the crash. mm could be removed from
current task struct before

mmput()->
exit_mmap()->
delete_vma_from_mm()

the detailed fault information:

NULL pointer access
Kernel OOPS in progress
Deferred Exception context
CURRENT PROCESS:
COMM=modprobe PID=278 CPU=0
invalid mm
return address: [0x000531de]; contents of:
0x000531b0: c727 acea 0c42 181d 0000 0000 0000 a0a8
0x000531c0: b090 acaa 0c42 1806 0000 0000 0000 a0e8
0x000531d0: b0d0 e801 0000 05b3 0010 e522 0046 [a090]
0x000531e0: 6408 b090 0c00 17cc 3042 e3ff f37b 2fc8

CPU: 0 PID: 278 Comm: modprobe Not tainted 3.15.0-ADI-2014R1-pre-00345-gea9f446 #25
task: 0572b720 ti: 0569e000 task.ti: 0569e000
Compiled for cpu family 0x27fe (Rev 0), but running on:0x0000 (Rev 0)
ADSP-BF609-0.0 500(MHz CCLK) 125(MHz SCLK) (mpu off)
Linux version 3.15.0-ADI-2014R1-pre-00345-gea9f446 (steven@steven-OptiPlex-390) (gcc version 4.3.5 (ADI-trunk/svn-5962) ) #25 Tue Jun 10 17:47:46 CST 2014

SEQUENCER STATUS: Not tainted
SEQSTAT: 00000027 IPEND: 8008 IMASK: ffff SYSCFG: 2806
EXCAUSE : 0x27
physical IVG3 asserted : { _trap + 0x0 }
physical IVG15 asserted : { _evt_system_call + 0x0 }
logical irq 6 mapped : { _bfin_coretmr_interrupt + 0x0 }
logical irq 7 mapped : { _bfin_fault_routine + 0x0 }
logical irq 11 mapped : { _l2_ecc_err + 0x0 }
logical irq 13 mapped : { _bfin_fault_routine + 0x0 }
logical irq 39 mapped : { _bfin_twi_interrupt_entry + 0x0 }
logical irq 40 mapped : { _bfin_twi_interrupt_entry + 0x0 }
RETE: /* Maybe null pointer? */
RETN: /* kernel dynamic memory (maybe user-space) */
RETX: /* Maybe fixed code section */
RETS: { _exit_mmap + 0x28 }
PC : { _delete_vma_from_mm + 0x92 }
DCPLB_FAULT_ADDR: /* Maybe null pointer? */
ICPLB_FAULT_ADDR: { _delete_vma_from_mm + 0x92 }
PROCESSOR STATE:
R0 : 00000004 R1 : 0569e000 R2 : 00bf3db4 R3 : 00000000
R4 : 057f9800 R5 : 00000001 R6 : 0569ddd0 R7 : 0572b720
P0 : 0572b854 P1 : 00000004 P2 : 00000000 P3 : 0569dda0
P4 : 0572b720 P5 : 0566c368 FP : 0569fe5c SP : 0569fd74
LB0: 057f523f LT0: 057f523e LC0: 00000000
LB1: 0005317c LT1: 00053172 LC1: 00000002
B0 : 00000000 L0 : 00000000 M0 : 0566f5bc I0 : 00000000
B1 : 00000000 L1 : 00000000 M1 : 00000000 I1 : ffffffff
B2 : 00000001 L2 : 00000000 M2 : 00000000 I2 : 00000000
B3 : 00000000 L3 : 00000000 M3 : 00000000 I3 : 057f8000
A0.w: 00000000 A0.x: 00000000 A1.w: 00000000 A1.x: 00000000
USP : 056ffcf8 ASTAT: 02003024

Hardware Trace:
0 Target : { _trap_c + 0x0 }
Source : { _exception_to_level5 + 0xa0 } JUMP.L
1 Target : { _exception_to_level5 + 0x0 }
Source : { _bfin_return_from_exception + 0x6 } RTX
2 Target : { _bfin_return_from_exception + 0x0 }
Source : { _ex_trap_c + 0x70 } JUMP.S
3 Target : { _ex_trap_c + 0x0 }
Source : { _trap + 0x2a } JUMP (P4)
4 Target : { _trap + 0x0 }
FAULT : { _delete_vma_from_mm + 0x92 } P0 = W[P2 + 2]
Source : { _delete_vma_from_mm + 0x8e } P2 = [P4 + 0x18]
5 Target : { _delete_vma_from_mm + 0x8e }
Source : { _delete_vma_from_mm + 0x2a } IF CC JUMP pcrel
6 Target : { _delete_vma_from_mm + 0x0 }
Source : { _exit_mmap + 0x24 } JUMP.L
7 Target : { _exit_mmap + 0x1c }
Source : { _exit_mmap + 0x38 } IF !CC JUMP pcrel (BP)
8 Target : { _exit_mmap + 0x34 }
Source : { __cond_resched + 0x20 } RTS
9 Target : { __cond_resched + 0x0 }
Source : { _exit_mmap + 0x30 } JUMP.L
10 Target : { _exit_mmap + 0x30 }
Source : { _delete_vma + 0xb2 } RTS
11 Target : { _delete_vma + 0xac }
Source : { _kmem_cache_free + 0xba } RTS
12 Target : { _kmem_cache_free + 0xa8 }
Source : { _kmem_cache_free + 0x9e } IF !CC JUMP pcrel (BP)
13 Target : { _kmem_cache_free + 0x92 }
Source : { _kmem_cache_free + 0x5a } IF CC JUMP pcrel
14 Target : { _kmem_cache_free + 0x34 }
Source : { _kmem_cache_free + 0xe } IF CC JUMP pcrel (BP)
15 Target : { _kmem_cache_free + 0x0 }
Source : { _delete_vma + 0xa8 } JUMP.L
Kernel Stack
Stack info:
SP: [0x0569ff24] /* kernel dynamic memory (maybe user-space) */
Memory from 0x0569ff20 to 056a0000
0569ff20: 00000001 [04e8da5a] 00008000 00000000 00000000 056a0000 04e8da5a 04e8da5a
0569ff40: 04eb9eea ffa00dce 02003025 04ea09c5 057f523f 04ea09c4 057f523e 00000000
0569ff60: 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000000
0569ff80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0569ffa0: 0566f5bc 057f8000 057f8000 00000001 04ec0170 056ffcf8 056ffd04 057f9800
0569ffc0: 04d1d498 057f9800 057f8fe4 057f8ef0 00000001 057f928c 00000001 00000001
0569ffe0: 057f9800 00000000 00000008 00000007 00000001 00000001 00000001
Return addresses in stack:
address : { _show_cpuinfo + 0x2d2 }
Modules linked in:
Kernel panic - not syncing: Kernel exception
[ end Kernel panic - not syncing: Kernel exception

Signed-off-by: Steven Miao
Acked-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Cc: [3.15.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Miao
2014-06-24 07:47:43 +0800

07 Jun, 2014

1 commit

b1de0d139 mm: convert some level-less printks to pr_* ... Browse Code »

printk is meant to be used with an associated log level. There are some
instances of printk scattered around the mm code where the log level is
missing. Add a log level and adhere to suggestions by
scripts/checkpatch.pl by moving to the pr_* macros.

Also add the typical pr_fmt definition so that print statements can be
easily traced back to the modules where they occur, correlated one with
another, etc. This will require the removal of some (now redundant)
prefixes on a few print statements.

Signed-off-by: Mitchel Humpherys
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mitchel Humpherys
2014-06-07 07:08:18 +0800

08 Apr, 2014

4 commits

ac7149045 mm: fix 'ERROR: do not initialise globals to 0 or NULL' and coding style ... Browse Code »

Signed-off-by: Choi Gi-yong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Choi Gi-yong
2014-04-08 07:35:55 +0800
3b32123d7 mm: use macros from compiler.h instead of __attribute__((...)) ... Browse Code »

To increase compiler portability there is which
provides convenience macros for various gcc constructs. Eg: __weak for
__attribute__((weak)). I've replaced all instances of gcc attributes with
the right macro in the memory management (/mm) subsystem.

[akpm@linux-foundation.org: while-we're-there consistency tweaks]
Signed-off-by: Gideon Israel Dsouza
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gideon Israel Dsouza
2014-04-08 07:35:54 +0800
615d6e875 mm: per-thread vma caching ... Browse Code »

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: Linus Torvalds
Reviewed-by: Michel Lespinasse
Cc: Oleg Nesterov
Tested-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-04-08 07:35:53 +0800
f1820361f mm: implement ->map_pages for page cache ... Browse Code »

filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov
Acked-by: Linus Torvalds
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Matthew Wilcox
Cc: Dave Hansen
Cc: Alexander Viro
Cc: Dave Chinner
Cc: Ning Qu
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-04-08 07:35:53 +0800

31 Mar, 2014

1 commit

d7a06983a locks: fix locks_mandatory_locked to respect file-private locks ... Browse Code »

As Trond pointed out, you can currently deadlock yourself by setting a
file-private lock on a file that requires mandatory locking and then
trying to do I/O on it.

Avoid this problem by plumbing some knowledge of file-private locks into
the mandatory locking code. In order to do this, we must pass down
information about the struct file that's being used to
locks_verify_locked.

Reported-by: Trond Myklebust
Signed-off-by: Jeff Layton
Acked-by: J. Bruce Fields

Jeff Layton
2014-03-31 20:24:43 +0800

22 Jan, 2014

1 commit

49f0ce5f9 mm: add overcommit_kbytes sysctl variable ... Browse Code »

Some applications that run on HPC clusters are designed around the
availability of RAM and the overcommit ratio is fine tuned to get the
maximum usage of memory without swapping. With growing memory, the
1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
for these workload (on a 2TB machine it represents no less than 20GB).

This patch adds the new overcommit_kbytes sysctl variable that allow a
much finer grain.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Jerome Marchand
Cc: Dave Hansen
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2014-01-22 08:19:44 +0800

13 Nov, 2013

2 commits

5cbb3d216 Merge branch 'akpm' (patches from Andrew Morton) ... Browse Code »

Merge first patch-bomb from Andrew Morton:
"Quite a lot of other stuff is banked up awaiting further
next->mainline merging, but this batch contains:

- Lots of random misc patches
- OCFS2
- Most of MM
- backlight updates
- lib/ updates
- printk updates
- checkpatch updates
- epoll tweaking
- rtc updates
- hfs
- hfsplus
- documentation
- procfs
- update gcov to gcc-4.7 format
- IPC"

* emailed patches from Andrew Morton : (269 commits)
ipc, msg: fix message length check for negative values
ipc/util.c: remove unnecessary work pending test
devpts: plug the memory leak in kill_sb
./Makefile: export initial ramdisk compression config option
init/Kconfig: add option to disable kernel compression
drivers: w1: make w1_slave::flags long to avoid memory corruption
drivers/w1/masters/ds1wm.cuse dev_get_platdata()
drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
drivers/memstick/core/mspro_block.c: fix attributes array allocation
drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
kernel/panic.c: reduce 1 byte usage for print tainted buffer
gcov: reuse kbasename helper
kernel/gcov/fs.c: use pr_warn()
kernel/module.c: use pr_foo()
gcov: compile specific gcov implementation based on gcc version
gcov: add support for gcc 4.7 gcov format
gcov: move gcov structs definitions to a gcc version specific file
kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
...

Linus Torvalds
2013-11-13 14:45:43 +0800
00619bcc4 mm: factor commit limit calculation ... Browse Code »

The same calculation is currently done in three differents places.
Factor that code so future changes has to be made at only one place.

[akpm@linux-foundation.org: uninline vm_commit_limit()]
Signed-off-by: Jerome Marchand
Cc: Dave Hansen
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2013-11-13 11:09:11 +0800

25 Oct, 2013

1 commit

72c2d5319 file->f_op is never NULL... ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-10-25 11:34:54 +0800

11 Jul, 2013

1 commit

98d1e64f9 mm: remove free_area_cache ... Browse Code »

Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Cc: "James E.J. Bottomley"
Cc: "Luck, Tony"
Cc: Benjamin Herrenschmidt
Cc: David Howells
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Matt Turner
Cc: Paul Mackerras
Cc: Richard Henderson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-07-11 09:11:34 +0800

04 Jul, 2013

2 commits

189541818 mm: kill global variable num_physpages ... Browse Code »

Now all references to num_physpages have been removed, so kill it.

Signed-off-by: Jiang Liu
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Jiang Liu
Cc: Hugh Dickins
Cc: David Rientjes
Cc: Al Viro
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-07-04 07:07:38 +0800
9bde916bc mm/nommu.c: add additional check for vread() just like vwrite() has done ... Browse Code »

vwrite() checks for overflow. vread() should do the same thing.

Since vwrite() checks the source buffer address, vread() should check
the destination buffer address.

Signed-off-by: Chen Gang
Cc: Al Viro
Cc: Michel Lespinasse
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2013-07-04 07:07:31 +0800

01 May, 2013

1 commit

08d767608 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions

Linus Torvalds
2013-05-01 22:21:43 +0800

30 Apr, 2013

3 commits

4eeab4f55 mm: replace hardcoded 3% with admin_reserve_pages knob ... Browse Code »

Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
memory reserve to something other than 3%, which may be multiple
gigabytes on large memory systems. Only about 8MB is necessary to
enable recovery in the default mode, and only a few hundred MB are
required even when overcommit is disabled.

This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.

admin_reserve_kbytes is initialized to min(3% free pages, 8MB)

I arrived at 8MB by summing the RSS of sshd or login, bash, and top.

Please see first patch in this series for full background, motivation,
testing, and full changelog.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_admin_reserve() static]
Signed-off-by: Andrew Shewmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Shewmaker
2013-04-30 06:54:36 +0800
c9b1d0981 mm: limit growth of 3% hardcoded other user reserve ... Browse Code »

Add user_reserve_kbytes knob.

Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.

user_reserve_pages defaults to min(3% free pages, 128MB)

I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.

This only affects OVERCOMMIT_NEVER mode.

Background

1. user reserve

__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:

bash: fork: Cannot allocate memory

2. admin reserve

Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.

Note that this reserve shrinks, and doesn't guarantee a useful reserve.

Motivation

The two hardcoded memory reserves should be updated to account for current
memory sizes.

Also, the admin reserve would be more useful if it didn't shrink too much.

When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.

I've found that reducing these reserves is especially beneficial for a
specific type of application load:

* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running

I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.

However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.

The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.

Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.

Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.

Risks

* "bash: fork: Cannot allocate memory"

The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.

Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.

* root-cant-log-in problem

The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.

However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.

Alternatives

* Memory cgroups provide a more flexible way to limit application memory.

Not everyone wants to set up cgroups or deal with their overhead.

* We could create a fourth overcommit mode which provides smaller reserves.

The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.

* Force users to initialize all of their memory or use calloc.

Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.

The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.

FAQ

* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?

Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.

On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'

admin_reserve_pages defaults to min(3% free memory, 8MB)

So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.

* How do you calculate a minimum useful reserve?

A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:

sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.

For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.

When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.

When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.

* What happens if user_reserve_pages is set to 0?

Note, this only affects overcomitt 'never' mode.

Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.

However, they will easily see a message such as:

"bash: fork: Cannot allocate memory"

And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.

* What's the difference between overcommit 'guess' and 'never'?

"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.

"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.

* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?

Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.

Test Summary

There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.

Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.

Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.

Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.

With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:

1. maximize user-allocatable memory, running close to the edge of
recoverability

2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system

Test Description

Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap

System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.

Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo

In overcommit 'never' mode, memory_ratio=100

Test Results

3.9.0-rc1-mm1

Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no

never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes

3.9.0-rc1-mm1-tunablereserves

User and Admin Recovery show their respective reserves, if applicable.

Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no

* process would successfully mlock, then the oom killer would pick it

never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely

never no 1 5359/5448 no 10MB no 10MB barely

never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes

never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes

* more memtesters were launched, able to allocate approximately another 100MB

Future Work

- Test larger memory systems.

- Test an embedded image.

- Test other architectures.

- Time malloc microbenchmarks.

- Would it be useful to be able to set overcommit policy for
each memory cgroup?

- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Shewmaker
2013-04-30 06:54:36 +0800
f1c4069e1 mm, vmalloc: export vmap_area_list, instead of vmlist ... Browse Code »

Although our intention is to unexport internal structure entirely, but
there is one exception for kexec. kexec dumps address of vmlist and
makedumpfile uses this information.

We are about to remove vmlist, then another way to retrieve information
of vmalloc layer is needed for makedumpfile. For this purpose, we
export vmap_area_list, instead of vmlist.

Signed-off-by: Joonsoo Kim
Signed-off-by: Joonsoo Kim
Cc: Eric Biederman
Cc: Dave Anderson
Cc: Vivek Goyal
Cc: Atsushi Kumagai
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Chris Metcalf
Cc: Guan Xuetao
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-04-30 06:54:34 +0800

28 Apr, 2013

1 commit

3c0b9de6d vm: add no-mmu vm_iomap_memory() stub ... Browse Code »

I think we could just move the full vm_iomap_memory() function into
util.h or similar, but I didn't get any reply from anybody actually
using nommu even to this trivial patch, so I'm not going to touch it any
more than required.

Here's the fairly minimal stub to make the nommu case at least
potentially work. It doesn't seem like anybody cares, though.

Signed-off-by: Linus Torvalds

Linus Torvalds
2013-04-28 04:25:38 +0800

05 Apr, 2013

1 commit

b6a9b7f6b mm: prevent mmap_cache race in find_vma() ... Browse Code »

find_vma() can be called by multiple threads with read lock
held on mm->mmap_sem and any of them can update mm->mmap_cache.
Prevent compiler from re-fetching mm->mmap_cache, because other
readers could update it in the meantime:

thread 1 thread 2
|
find_vma() | find_vma()
struct vm_area_struct *vma = NULL; |
vma = mm->mmap_cache; |
if (!(vma && vma->vm_end > addr |
&& vma->vm_start mmap_cache = vma;
return vma; |
^^ compiler may optimize this |
local variable out and re-read |
mm->mmap_cache |

This issue can be reproduced with gcc-4.8.0-1 on s390x by running
mallocstress testcase from LTP, which triggers:

kernel BUG at mm/rmap.c:1088!
Call Trace:
([] 0x3d100c57000)
[] do_wp_page+0x2fc/0xa88
[] handle_pte_fault+0x41a/0xac8
[] handle_mm_fault+0x17a/0x268
[] do_protection_exception+0x1e2/0x394
[] pgm_check_handler+0x138/0x13c
[] 0x3fffcf1f07a
Last Breaking-Event-Address:
[] page_add_new_anon_rmap+0xc2/0x168

Thanks to Jakub Jelinek for his insight on gcc and helping to
track this down.

Signed-off-by: Jan Stancek
Acked-by: David Rientjes
Signed-off-by: Hugh Dickins
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Jan Stancek
2013-04-05 02:46:28 +0800

04 Mar, 2013

1 commit

4b377bab2 make do_mremap() static ... Browse Code »

The extern in sys_sparc_64.c was a rudiment of time when do_mremap()
used to exist in MMU case (it doesn't anymore). As for !MMU one,
nothing uses it outside of mm/nommu.c...

Signed-off-by: Al Viro

Al Viro
2013-03-04 23:47:59 +0800

27 Feb, 2013

1 commit

d895cb1af Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.

The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.

Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.

PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...

Linus Torvalds
2013-02-27 12:16:07 +0800