13 Oct, 2016
1 commit
-
This affectively reverts commit 377ccbb48373 ("Makefile: Mute warning
for __builtin_return_address(>0) for tracing only") because it turns out
that it really isn't tracing only - it's all over the tree.We already also had the warning disabled separately for mm/usercopy.c
(which this commit also removes), and it turns out that we will also
want to disable it for get_lock_parent_ip(), that is used for at least
TRACE_IRQFLAGS. Which (when enabled) ends up being all over the tree.Steven Rostedt had a patch that tried to limit it to just the config
options that actually triggered this, but quite frankly, the extra
complexity and abstraction just isn't worth it. We have never actually
had a case where the warning is actually useful, so let's just disable
it globally and not worry about it.Acked-by: Steven Rostedt
Cc: Thomas Gleixner
Cc: Andrew Morton
Cc: Ingo Molnar
Cc: Peter Anvin
Signed-off-by: Linus Torvalds
09 Aug, 2016
1 commit
-
Pull usercopy protection from Kees Cook:
"Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
copy_from_user bounds checking for most architectures on SLAB and
SLUB"* tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
mm: SLUB hardened usercopy support
mm: SLAB hardened usercopy support
s390/uaccess: Enable hardened usercopy
sparc/uaccess: Enable hardened usercopy
powerpc/uaccess: Enable hardened usercopy
ia64/uaccess: Enable hardened usercopy
arm64/uaccess: Enable hardened usercopy
ARM: uaccess: Enable hardened usercopy
x86/uaccess: Enable hardened usercopy
mm: Hardened usercopy
mm: Implement stack frame object validation
mm: Add is_migrate_cma_page
27 Jul, 2016
2 commits
-
khugepaged implementation grew to the point when it deserve separate
file in source.Let's move it to mm/khugepaged.c.
Link: http://lkml.kernel.org/r/1466021202-61880-32-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is the start of porting PAX_USERCOPY into the mainline kernel. This
is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
work is based on code by PaX Team and Brad Spengler, and an earlier port
from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.This patch contains the logic for validating several conditions when
performing copy_to_user() and copy_from_user() on the kernel object
being copied to/from:
- address range doesn't wrap around
- address range isn't NULL or zero-allocated (with a non-zero copy size)
- if on the slab allocator:
- object size must be less than or equal to copy size (when check is
implemented in the allocator, which appear in subsequent patches)
- otherwise, object must not span page allocations (excepting Reserved
and CMA ranges)
- if on the stack
- object must not extend before/after the current process stack
- object must be contained by a valid stack frame (when there is
arch/build support for identifying stack frames)
- object must not overlap with kernel textSigned-off-by: Kees Cook
Tested-by: Valdis Kletnieks
Tested-by: Michael Ellerman
21 May, 2016
1 commit
-
This patch introduces z3fold, a special purpose allocator for storing
compressed pages. It is designed to store up to three compressed pages
per physical page. It is a ZBUD derivative which allows for higher
compression ratio keeping the simplicity and determinism of its
predecessor.This patch comes as a follow-up to the discussions at the Embedded Linux
Conference in San-Diego related to the talk [1]. The outcome of these
discussions was that it would be good to have a compressed page
allocator as stable and deterministic as zbud with with higher
compression ratio.To keep the determinism and simplicity, z3fold, just like zbud, always
stores an integral number of compressed pages per page, but it can store
up to 3 pages unlike zbud which can store at most 2. Therefore the
compression ratio goes to around 2.6x while zbud's one is around 1.7x.The patch is based on the latest linux.git tree.
This version has been updated after testing on various simulators (e.g.
ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
comments from Dan Streetman [3].[1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
[2] https://lkml.org/lkml/2016/4/21/799
[3] https://lkml.org/lkml/2016/5/4/852Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com
Signed-off-by: Vitaly Wool
Cc: Seth Jennings
Cc: Dan Streetman
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Mar, 2016
1 commit
-
Add KASAN hooks to SLAB allocator.
This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.Signed-off-by: Alexander Potapenko
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrey Konovalov
Cc: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: Steven Rostedt
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Mar, 2016
1 commit
-
kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing). Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system. A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/). However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.kcov does not aim to collect as much coverage as possible. It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g. scheduler, locking).Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes. Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch). I've
dropped the second mode for simplicity.This patch adds the necessary support on kernel side. The complimentary
compiler support was added in gcc revision 231296.We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:https://github.com/google/syzkaller/wiki/Found-Bugs
We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation". For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.Why not gcov. Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
typical coverage can be just a dozen of basic blocks (e.g. an invalid
input). In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M). Cost of
kcov depends only on number of executed basic blocks/edges. On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.kcov exposes kernel PCs and control flow to user-space which is
insecure. But debugfs should not be mapped as user accessible.Based on a patch by Quentin Casasnovas.
[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov
Reviewed-by: Kees Cook
Cc: syzkaller
Cc: Vegard Nossum
Cc: Catalin Marinas
Cc: Tavis Ormandy
Cc: Will Deacon
Cc: Quentin Casasnovas
Cc: Kostya Serebryany
Cc: Eric Dumazet
Cc: Alexander Potapenko
Cc: Kees Cook
Cc: Bjorn Helgaas
Cc: Sasha Levin
Cc: David Drysdale
Cc: Ard Biesheuvel
Cc: Andrey Ryabinin
Cc: Kirill A. Shutemov
Cc: Jiri Slaby
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Mar, 2016
1 commit
-
CMA allocation should be guaranteed to succeed by definition, but,
unfortunately, it would be failed sometimes. It is hard to track down
the problem, because it is related to page reference manipulation and we
don't have any facility to analyze it.This patch adds tracepoints to track down page reference manipulation.
With it, we can find exact reason of failure and can fix the problem.
Following is an example of tracepoint output. (note: this example is
stale version that printing flags as the number. Recent version will
print it as human readable string.)-9018 [004] 92.678375: page_ref_set: pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
-9018 [004] 92.678378: kernel_stack:
=> get_page_from_freelist (ffffffff81176659)
=> __alloc_pages_nodemask (ffffffff81176d22)
=> alloc_pages_vma (ffffffff811bf675)
=> handle_mm_fault (ffffffff8119e693)
=> __do_page_fault (ffffffff810631ea)
=> trace_do_page_fault (ffffffff81063543)
=> do_async_page_fault (ffffffff8105c40a)
=> async_page_fault (ffffffff817581d8)
[snip]
-9018 [004] 92.678379: page_ref_mod: pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
[snip]
...
...
-9131 [001] 93.174468: test_pages_isolated: start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
[snip]
-9018 [004] 93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
=> release_pages (ffffffff8117c9e4)
=> free_pages_and_swap_cache (ffffffff811b0697)
=> tlb_flush_mmu_free (ffffffff81199616)
=> tlb_finish_mmu (ffffffff8119a62c)
=> exit_mmap (ffffffff811a53f7)
=> mmput (ffffffff81073f47)
=> do_exit (ffffffff810794e9)
=> do_group_exit (ffffffff81079def)
=> SyS_exit_group (ffffffff81079e74)
=> entry_SYSCALL_64_fastpath (ffffffff817560b6)This output shows that problem comes from exit path. In exit path, to
improve performance, pages are not freed immediately. They are gathered
and processed by batch. During this process, migration cannot be
possible and CMA allocation is failed. This problem is hard to find
without this page reference tracepoint facility.Enabling this feature bloat kernel text 30 KB in my configuration.
text data bss dec hex filename
12127327 2243616 1507328 15878271 f2487f vmlinux_disabled
12157208 2258880 1507328 15923416 f2f8d8 vmlinux_enabledNote that, due to header file dependency problem between mm.h and
tracepoint.h, this feature has to open code the static key functions for
tracepoints. Proposed by Steven Rostedt in following link.https://lkml.org/lkml/2015/12/9/699
[arnd@arndb.de: crypto/async_pq: use __free_page() instead of put_page()]
[iamjoonsoo.kim@lge.com: fix build failure for xtensa]
[akpm@linux-foundation.org: tweak Kconfig text, per Vlastimil]
Signed-off-by: Joonsoo Kim
Acked-by: Michal Nazarewicz
Acked-by: Vlastimil Babka
Cc: Minchan Kim
Cc: Mel Gorman
Cc: "Kirill A. Shutemov"
Cc: Sergey Senozhatsky
Acked-by: Steven Rostedt
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Mar, 2016
1 commit
-
Page poisoning is currently set up as a feature if architectures don't
have architecture debug page_alloc to allow unmapping of pages. It has
uses apart from that though. Clearing of the pages on free provides an
increase in security as it helps to limit the risk of information leaks.
Allow page poisoning to be enabled as a separate option independent of
kernel_map pages since the two features do separate work. Because of
how hiberanation is implemented, the checks on alloc cannot occur if
hibernation is enabled. The runtime alloc checks can also be enabled
with an option when !HIBERNATION.Credit to Grsecurity/PaX team for inspiring this work
Signed-off-by: Laura Abbott
Cc: Rafael J. Wysocki
Cc: "Kirill A. Shutemov"
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Kees Cook
Cc: Mathias Krause
Cc: Dave Hansen
Cc: Jianyu Zhan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Sep, 2015
1 commit
-
Pull media updates from Mauro Carvalho Chehab:
"A series of patches that move part of the code used to allocate memory
from the media subsystem to the mm subsystem"[ The mm parts have been acked by VM people, and the series was
apparently in -mm for a while - Linus ]* tag 'media/v4.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] drm/exynos: Convert g2d_userptr_get_dma_addr() to use get_vaddr_frames()
[media] media: vb2: Remove unused functions
[media] media: vb2: Convert vb2_dc_get_userptr() to use frame vector
[media] media: vb2: Convert vb2_vmalloc_get_userptr() to use frame vector
[media] media: vb2: Convert vb2_dma_sg_get_userptr() to use frame vector
[media] vb2: Provide helpers for mapping virtual addresses
[media] media: omap_vout: Convert omap_vout_uservirt_to_phys() to use get_vaddr_pfns()
[media] mm: Provide new get_vaddr_frames() helper
[media] vb2: Push mmap_sem down to memops
11 Sep, 2015
1 commit
-
Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced. However,
this method has two serious shortcomings:- it does not count unmapped file pages
- it affects the reclaimer logicTo overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g. by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework]
Signed-off-by: Vladimir Davydov
Reviewed-by: Andres Lagar-Cavilla
Cc: Minchan Kim
Cc: Raghavendra K T
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: David Rientjes
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Sep, 2015
1 commit
-
This implements mcopy_atomic and mfill_zeropage that are the lowlevel
VM methods that are invoked respectively by the UFFDIO_COPY and
UFFDIO_ZEROPAGE userfaultfd commands.Signed-off-by: Andrea Arcangeli
Acked-by: Pavel Emelyanov
Cc: Sanidhya Kashyap
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov"
Cc: Andres Lagar-Cavilla
Cc: Dave Hansen
Cc: Paolo Bonzini
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Andy Lutomirski
Cc: Hugh Dickins
Cc: Peter Feiner
Cc: "Dr. David Alan Gilbert"
Cc: Johannes Weiner
Cc: "Huangpeng (Peter)"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Aug, 2015
1 commit
-
Provide new function get_vaddr_frames(). This function maps virtual
addresses from given start and fills given array with page frame numbers of
the corresponding pages. If given start belongs to a normal vma, the function
grabs reference to each of the pages to pin them in memory. If start
belongs to VM_IO | VM_PFNMAP vma, we don't touch page structures. Caller
must make sure pfns aren't reused for anything else while he is using
them.This function is created for various drivers to simplify handling of
their buffers.Signed-off-by: Jan Kara
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Acked-by: Andrew Morton
Signed-off-by: Hans Verkuil
Signed-off-by: Mauro Carvalho Chehab
15 Apr, 2015
2 commits
-
Memtest is a simple feature which fills the memory with a given set of
patterns and validates memory contents, if bad memory regions is detected
it reserves them via memblock API. Since memblock API is widely used by
other architectures this feature can be enabled outside of x86 world.This patch set promotes memtest to live under generic mm umbrella and
enables memtest feature for arm/arm64.It was reported that this patch set was useful for tracking down an issue
with some errant DMA on an arm64 platform.This patch (of 6):
There is nothing platform dependent in the core memtest code, so other
platforms might benefit from this feature too.[linux@roeck-us.net: MEMTEST depends on MEMBLOCK]
Signed-off-by: Vladimir Murzin
Acked-by: Will Deacon
Tested-by: Mark Rutland
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Catalin Marinas
Cc: Russell King
Cc: Paul Bolle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I've noticed that there is no interfaces exposed by CMA which would let me
fuzz what's going on in there.This small patchset exposes some information out to userspace, plus adds
the ability to trigger allocation and freeing from userspace.This patch (of 3):
Implement a simple debugfs interface to expose information about CMA areas
in the system.Useful for testing/sanity checks for CMA since it was impossible to
previously retrieve this information in userspace.Signed-off-by: Sasha Levin
Acked-by: Joonsoo Kim
Cc: Marek Szyprowski
Cc: Laura Abbott
Cc: Konrad Rzeszutek Wilk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Feb, 2015
1 commit
-
Signed-off-by: Al Viro
17 Feb, 2015
1 commit
-
All callers of get_xip_mem() are now gone. Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.
Also remove mm/filemap_xip.c as it is now empty. Also remove
documentation of the long-gone get_xip_page().Signed-off-by: Matthew Wilcox
Cc: Andreas Dilger
Cc: Boaz Harrosh
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jan Kara
Cc: Jens Axboe
Cc: Kirill A. Shutemov
Cc: Mathieu Desnoyers
Cc: Randy Dunlap
Cc: Ross Zwisler
Cc: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
14 Feb, 2015
2 commits
-
With this patch kasan will be able to catch bugs in memory allocated by
slub. Initially all objects in newly allocated slab page, marked as
redzone. Later, when allocation of slub object happens, requested by
caller number of bytes marked as accessible, and the rest of the object
(including slub's metadata) marked as redzone (inaccessible).We also mark object as accessible if ksize was called for this object.
There is some places in kernel where ksize function is called to inquire
size of really allocated area. Such callers could validly access whole
allocated memory, so it should be marked as accessible.Code in slub.c and slab_common.c files could validly access to object's
metadata, so instrumentation for this files are disabled.Signed-off-by: Andrey Ryabinin
Signed-off-by: Dmitry Chernenkov
Cc: Dmitry Vyukov
Cc: Konstantin Serebryany
Signed-off-by: Andrey Konovalov
Cc: Yuri Gribov
Cc: Konstantin Khlebnikov
Cc: Sasha Levin
Cc: Christoph Lameter
Cc: Joonsoo Kim
Cc: Dave Hansen
Cc: Andi Kleen
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Kernel Address sanitizer (KASan) is a dynamic memory error detector. It
provides fast and comprehensive solution for finding use-after-free and
out-of-bounds bugs.KASAN uses compile-time instrumentation for checking every memory access,
therefore GCC > v4.9.2 required. v4.9.2 almost works, but has issues with
putting symbol aliases into the wrong section, which breaks kasan
instrumentation of globals.This patch only adds infrastructure for kernel address sanitizer. It's
not available for use yet. The idea and some code was borrowed from [1].Basic idea:
The main idea of KASAN is to use shadow memory to record whether each byte
of memory is safe to access or not, and use compiler's instrumentation to
check the shadow memory on each memory access.Address sanitizer uses 1/8 of the memory addressable in kernel for shadow
memory and uses direct mapping with a scale and offset to translate a
memory address to its corresponding shadow address.Here is function to translate address to corresponding shadow address:
unsigned long kasan_mem_to_shadow(unsigned long addr)
{
return (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET;
}where KASAN_SHADOW_SCALE_SHIFT = 3.
So for every 8 bytes there is one corresponding byte of shadow memory.
The following encoding used for each shadow byte: 0 means that all 8 bytes
of the corresponding memory region are valid for access; k (1
Acked-by: Michal Marek
Signed-off-by: Andrey Konovalov
Cc: Dmitry Vyukov
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Cc: Yuri Gribov
Cc: Konstantin Khlebnikov
Cc: Sasha Levin
Cc: Christoph Lameter
Cc: Joonsoo Kim
Cc: Dave Hansen
Cc: Andi Kleen
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Feb, 2015
1 commit
-
remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.Let's drop it and get rid of all these special-cased code.
The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.There are few basic tests in LTP for the syscall. They work just fine
with emulation.To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88xI believe we can live with that.
Test case:
#define _GNU_SOURCE
#include
#include
#include
#include#define MB (1024UL * 1024)
#define SIZE (4096 * MB)int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);munmap(p, SIZE);
}return 0;
}[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov"
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Dave Jones
Cc: Linus Torvalds
Cc: Armin Rigo
Signed-off-by: Sasha Levin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
14 Dec, 2014
2 commits
-
This is the page owner tracking code which is introduced so far ago. It
is resident on Andrew's tree, though, nobody tried to upstream so it
remain as is. Our company uses this feature actively to debug memory leak
or to find a memory hogger so I decide to upstream this feature.This functionality help us to know who allocates the page. When
allocating a page, we store some information about allocation in extra
memory. Later, if we need to know status of all pages, we can get and
analyze it from this stored information.In previous version of this feature, extra memory is statically defined in
struct page, but, in this version, extra memory is allocated outside of
struct page. It enables us to turn on/off this feature at boottime
without considerable memory waste.Although we already have tracepoint for tracing page allocation/free,
using it to analyze page owner is rather complex. We need to enlarge the
trace buffer for preventing overlapping until userspace program launched.
And, launched program continually dump out the trace buffer for later
analysis and it would change system behaviour with more possibility rather
than just keeping it in memory, so bad for debug.Moreover, we can use page_owner feature further for various purposes. For
example, we can use it for fragmentation statistics implemented in this
patch. And, I also plan to implement some CMA failure debugging feature
using this interface.I'd like to give the credit for all developers contributed this feature,
but, it's not easy because I don't know exact history. Sorry about that.
Below is people who has "Signed-off-by" in the patches in Andrew's tree.Contributor:
Alexander Nyberg
Mel Gorman
Dave Hansen
Minchan Kim
Michal Nazarewicz
Andrew Morton
Jungsoo SonSigned-off-by: Joonsoo Kim
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Dave Hansen
Cc: Michal Nazarewicz
Cc: Jungsoo Son
Cc: Ingo Molnar
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When we debug something, we'd like to insert some information to every
page. For this purpose, we sometimes modify struct page itself. But,
this has drawbacks. First, it requires re-compile. This makes us
hesitate to use the powerful debug feature so development process is
slowed down. And, second, sometimes it is impossible to rebuild the
kernel due to third party module dependency. At third, system behaviour
would be largely different after re-compile, because it changes size of
struct page greatly and this structure is accessed by every part of
kernel. Keeping this as it is would be better to reproduce errornous
situation.This feature is intended to overcome above mentioned problems. This
feature allocates memory for extended data per page in certain place
rather than the struct page itself. This memory can be accessed by the
accessor functions provided by this code. During the boot process, it
checks whether allocation of huge chunk of memory is needed or not. If
not, it avoids allocating memory at all. With this advantage, we can
include this feature into the kernel in default and can avoid rebuild and
solve related problems.Until now, memcg uses this technique. But, now, memcg decides to embed
their variable to struct page itself and it's code to extend struct page
has been removed. I'd like to use this code to develop debug feature, so
this patch resurrect it.To help these things to work well, this patch introduces two callbacks for
clients. One is the need callback which is mandatory if user wants to
avoid useless memory allocation at boot-time. The other is optional, init
callback, which is used to do proper initialization after memory is
allocated. Detailed explanation about purpose of these functions is in
code comment. Please refer it.Others are completely same with previous extension code in memcg.
Signed-off-by: Joonsoo Kim
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Dave Hansen
Cc: Michal Nazarewicz
Cc: Jungsoo Son
Cc: Ingo Molnar
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Dec, 2014
2 commits
-
Now that the external page_cgroup data structure and its lookup is gone,
the only code remaining in there is swap slot accounting.Rename it and move the conditional compilation into mm/Makefile.
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Acked-by: David S. Miller
Acked-by: KAMEZAWA Hiroyuki
Cc: "Kirill A. Shutemov"
Cc: Tejun Heo
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
stalled-cycles-frontend
stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
stalled-cycles-frontend
stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Cc: Tejun Heo
Cc: David Rientjes
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Oct, 2014
1 commit
-
Pull tinification fix from Josh "Paper Bag" Triplett:
"Fixup to use PATCHv2 of 'mm: Support compiling out madvise and
fadvise'"* tag 'tiny/no-advice-fixup-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/josh/linux:
mm: Support fadvise without CONFIG_MMU
11 Oct, 2014
1 commit
-
Commit d3ac21cacc24790eb45d735769f35753f5b56ceb ("mm: Support compiling
out madvise and fadvise") incorrectly made fadvise conditional on
CONFIG_MMU. (The merged branch unintentionally incorporated v1 of the
patch rather than the fixed v2.) Apply the delta from v1 to v2, to
allow fadvise without CONFIG_MMU.Reported-by: Johannes Weiner
Signed-off-by: Josh Triplett
10 Oct, 2014
2 commits
-
Always mark pages with PageBalloon even if balloon compaction is disabled
and expose this mark in /proc/kpageflags as KPF_BALLOON.Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
"balloon_deflate" and "balloon_migrate". They accumulate balloon
activity. Current size of balloon is (balloon_inflate - balloon_deflate)
pages.All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
It should be selected by ballooning driver which wants use this feature.
Currently virtio-balloon is the only user.Signed-off-by: Konstantin Khlebnikov
Cc: Rafael Aquini
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
dump_page() and dump_vma() are not specific to page_alloc.c, move them out
so page_alloc.c won't turn into the unofficial debug repository.Signed-off-by: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Aug, 2014
1 commit
-
Many embedded systems will not need these syscalls, and omitting them
saves space. Add a new EXPERT config option CONFIG_ADVISE_SYSCALLS
(default y) to support compiling them out.bloat-o-meter:
add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-2250 (-2250)
function old new delta
sys_fadvise64 57 - -57
sys_fadvise64_64 691 - -691
sys_madvise 1502 - -1502Signed-off-by: Josh Triplett
07 Aug, 2014
2 commits
-
Add zpool api.
zpool provides an interface for memory storage, typically of compressed
memory. Users can select what backend to use; currently the only
implementations are zbud, a low density implementation with up to two
compressed pages per storage page, and zsmalloc, a higher density
implementation with multiple compressed pages per storage page.Signed-off-by: Dan Streetman
Tested-by: Seth Jennings
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Weijie Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, there are two users on CMA functionality, one is the DMA
subsystem and the other is the KVM on powerpc. They have their own code
to manage CMA reserved area even if they looks really similar. From my
guess, it is caused by some needs on bitmap management. KVM side wants
to maintain bitmap not for 1 page, but for more size. Eventually it use
bitmap where one bit represents 64 pages.When I implement CMA related patches, I should change those two places
to apply my change and it seem to be painful to me. I want to change
this situation and reduce future code management overhead through this
patch.This change could also help developer who want to use CMA in their new
feature development, since they can use CMA easily without copying &
pasting this reserved area management code.In previous patches, we have prepared some features to generalize CMA
reserved area management and now it's time to do it. This patch moves
core functions to mm/cma.c and change DMA APIs to use these functions.There is no functional change in DMA APIs.
Signed-off-by: Joonsoo Kim
Acked-by: Michal Nazarewicz
Acked-by: Zhang Yanfei
Acked-by: Minchan Kim
Reviewed-by: Aneesh Kumar K.V
Cc: Alexander Graf
Cc: Aneesh Kumar K.V
Cc: Gleb Natapov
Acked-by: Marek Szyprowski
Tested-by: Marek Szyprowski
Cc: Paolo Bonzini
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jun, 2014
1 commit
-
mm/memory.c is overloaded: over 4k lines. get_user_pages() code is
pretty much self-contained let's move it to separate file.No other changes made.
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 May, 2014
1 commit
-
Continue moving some of the block files that are scattered around.
bounce.c contains only code for bouncing the contents of a bio.
It's block proper code, not mm code.Suggested-by: Ming Lei
Signed-off-by: Jens Axboe
13 Apr, 2014
1 commit
-
Pull vfs updates from Al Viro:
"The first vfs pile, with deep apologies for being very late in this
window.Assorted cleanups and fixes, plus a large preparatory part of iov_iter
work. There's a lot more of that, but it'll probably go into the next
merge window - it *does* shape up nicely, removes a lot of
boilerplate, gets rid of locking inconsistencie between aio_write and
splice_write and I hope to get Kent's direct-io rewrite merged into
the same queue, but some of the stuff after this point is having
(mostly trivial) conflicts with the things already merged into
mainline and with some I want more testing.This one passes LTP and xfstests without regressions, in addition to
usual beating. BTW, readahead02 in ltp syscalls testsuite has started
giving failures since "mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages" - might be a false
positive, might be a real regression..."* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
missing bits of "splice: fix racy pipe->buffers uses"
cifs: fix the race in cifs_writev()
ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
kill generic_file_buffered_write()
ocfs2_file_aio_write(): switch to generic_perform_write()
ceph_aio_write(): switch to generic_perform_write()
xfs_file_buffered_aio_write(): switch to generic_perform_write()
export generic_perform_write(), start getting rid of generic_file_buffer_write()
generic_file_direct_write(): get rid of ppos argument
btrfs_file_aio_write(): get rid of ppos
kill the 5th argument of generic_file_buffered_write()
kill the 4th argument of __generic_file_aio_write()
lustre: don't open-code kernel_recvmsg()
ocfs2: don't open-code kernel_recvmsg()
drbd: don't open-code kernel_recvmsg()
constify blk_rq_map_user_iov() and friends
lustre: switch to kernel_sendmsg()
ocfs2: don't open-code kernel_sendmsg()
take iov_iter stuff to mm/iov_iter.c
process_vm_access: tidy up a bit
...
08 Apr, 2014
2 commits
-
This patch creates a generic implementation of early_ioremap() support
based on the existing x86 implementation. early_ioremp() is useful for
early boot code which needs to temporarily map I/O or memory regions
before normal mapping functions such as ioremap() are available.Some architectures have optional MMU. In the no-MMU case, the remap
functions simply return the passed in physical address and the unmap
functions do nothing.Signed-off-by: Mark Salter
Acked-by: Catalin Marinas
Acked-by: H. Peter Anvin
Cc: Borislav Petkov
Cc: Dave Young
Cc: Will Deacon
Cc: Ingo Molnar
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: Linus Torvalds
Reviewed-by: Michel Lespinasse
Cc: Oleg Nesterov
Tested-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Apr, 2014
1 commit
-
The VM maintains cached filesystem pages on two types of lists. One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past. We call the recently usedbut
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size. By maintaining
a history of recently evicted file pages it can detect frequently used
pages with an arbitrarily small inactive list size, and subsequently
apply pressure on the active list based on actual demand for cache, not
just overall eviction speed.Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot. On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.This fixes the VM's blindness towards working set changes in excess of
the inactive list. And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Reviewed-by: Bob Liu
Cc: Andrea Arcangeli
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 Apr, 2014
1 commit
-
Signed-off-by: Al Viro
31 Jan, 2014
1 commit
-
This patch moves zsmalloc under mm directory.
Before that, description will explain why we have needed custom
allocator.Zsmalloc is a new slab-based memory allocator for storing compressed
pages. It is designed for low fragmentation and high allocation success
rate on large object, but
Acked-by: Nitin Gupta
Reviewed-by: Konrad Rzeszutek Wilk
Cc: Bob Liu
Cc: Greg Kroah-Hartman
Cc: Hugh Dickins
Cc: Jens Axboe
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Sep, 2013
1 commit
-
Several subsystems use the same construct for LRU lists - a list head, a
spin lock and and item count. They also use exactly the same code for
adding and removing items from the LRU. Create a generic type for these
LRU lists.This is the beginning of generic, node aware LRUs for shrinkers to work
with.[glommer@openvz.org: enum defined constants for lru. Suggested by gthelen, don't relock over retry]
Signed-off-by: Dave Chinner
Signed-off-by: Glauber Costa
Reviewed-by: Greg Thelen
Acked-by: Mel Gorman
Cc: "Theodore Ts'o"
Cc: Adrian Hunter
Cc: Al Viro
Cc: Artem Bityutskiy
Cc: Arve Hjønnevåg
Cc: Carlos Maiolino
Cc: Christoph Hellwig
Cc: Chuck Lever
Cc: Daniel Vetter
Cc: David Rientjes
Cc: Gleb Natapov
Cc: Greg Thelen
Cc: J. Bruce Fields
Cc: Jan Kara
Cc: Jerome Glisse
Cc: John Stultz
Cc: KAMEZAWA Hiroyuki
Cc: Kent Overstreet
Cc: Kirill A. Shutemov
Cc: Marcelo Tosatti
Cc: Mel Gorman
Cc: Steven Whitehouse
Cc: Thomas Hellstrom
Cc: Trond Myklebust
Signed-off-by: Andrew MortonSigned-off-by: Al Viro